Oppiskele Histograms and Box Plots

About Histograms

To visually evaluate the distribution, you need to build histograms. If the distributions are far from normal, we should notice it right away.

Picture time! Let's build distributions for two groups on one graph.


              123456789101112131415161718192021
            
# Import the libraries 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read .csv files 
df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';')
df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';')

# Plotting hists of Impression columns 
sns.histplot(df_control['Impression'], color='#1e2635', label='Group A')
sns.histplot(df_test['Impression'], color='#ff8a00', label='Group B')

# Add the legend to the graph
plt.legend(title='Groups')
plt.xlabel('Impression')
plt.ylabel('Frequency')
plt.title('Distribution of Impressions')

# Show the graph
plt.show()

In this code, we use the sns.histplot function from the seaborn library. We pass it to the desired column df_control['Impression'] to compare with df_test['Impression'].

Are these distributions normal? Hard to tell...

Let's look at box plots:

About Boxplots


              12345678910111213141516171819202122232425
            
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read .csv files 
df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';')
df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';')

# Add to the dataframes columns-labels, which mean belonging to either the control or the test group
df_control['group'] = 'Contol group'
df_test['group'] = 'Test group'

# Concat control and test dataframes 
df_combined = pd.concat([df_control, df_test])
sns.boxplot(data=df_combined, x='group', y='Impression', palette=['#1e2635', '#ff8a00'],
            medianprops={'color': 'red'})

# Sign the axes
plt.xlabel('')
plt.ylabel('Impression')
plt.title('Comparison of Impressions')

# Show the results
plt.show()

Even after boxplots, it is not clear whether the distributions are normal.

In order to display two boxplots on the same chart, we combine the data frames using the pd.concat function.

Next, we use the sns.boxplot function, passing the combined data frame df_combined to it. On the x-axis are the values of the column 'Impression', and on the y-axis are the Сontrol and Test group. With the help of the matplotlib library, we sign the plot and axes.

Even after boxplots, it is not clear whether the distributions are normal. But in normality, we need to be sure.

How to do it? Statistical tests come to the rescue, which we will discuss in the next chapter.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 4

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme