Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Histograms and Box Plots | Normality Check
The Art of A/B Testing

book
Histograms and Box Plots

About Histograms

To visually evaluate the distribution, you need to build histograms. If the distributions are far from normal, we should notice it right away.

Picture time! Let's build distributions for two groups on one graph.

# Import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read .csv files
df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';')
df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';')

# Plotting hists of Impression columns
sns.histplot(df_control['Impression'], color='#1e2635', label='Group A')
sns.histplot(df_test['Impression'], color='#ff8a00', label='Group B')

# Add the legend to the graph
plt.legend(title='Groups')
plt.xlabel('Impression')
plt.ylabel('Frequency')
plt.title('Distribution of Impressions')

# Show the graph
plt.show()
123456789101112131415161718192021
# Import the libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Read .csv files df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';') df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';') # Plotting hists of Impression columns sns.histplot(df_control['Impression'], color='#1e2635', label='Group A') sns.histplot(df_test['Impression'], color='#ff8a00', label='Group B') # Add the legend to the graph plt.legend(title='Groups') plt.xlabel('Impression') plt.ylabel('Frequency') plt.title('Distribution of Impressions') # Show the graph plt.show()
copy

In this code, we use the sns.histplot function from the seaborn library. We pass it to the desired column df_control['Impression'] to compare with df_test['Impression'].

Are these distributions normal? Hard to tell...

Let's look at box plots:

About Boxplots

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read .csv files
df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';')
df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';')

# Add to the dataframes columns-labels, which mean belonging to either the control or the test group
df_control['group'] = 'Contol group'
df_test['group'] = 'Test group'

# Concat control and test dataframes
df_combined = pd.concat([df_control, df_test])
sns.boxplot(data=df_combined, x='group', y='Impression', palette=['#1e2635', '#ff8a00'],
medianprops={'color': 'red'})

# Sign the axes
plt.xlabel('')
plt.ylabel('Impression')
plt.title('Comparison of Impressions')

# Show the results
plt.show()
12345678910111213141516171819202122232425
# Import libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Read .csv files df_control = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_control.csv', delimiter=';') df_test = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/c3b98ad3-420d-403f-908d-6ab8facc3e28/ab_test.csv', delimiter=';') # Add to the dataframes columns-labels, which mean belonging to either the control or the test group df_control['group'] = 'Contol group' df_test['group'] = 'Test group' # Concat control and test dataframes df_combined = pd.concat([df_control, df_test]) sns.boxplot(data=df_combined, x='group', y='Impression', palette=['#1e2635', '#ff8a00'], medianprops={'color': 'red'}) # Sign the axes plt.xlabel('') plt.ylabel('Impression') plt.title('Comparison of Impressions') # Show the results plt.show()
copy

Even after boxplots, it is not clear whether the distributions are normal.

In order to display two boxplots on the same chart, we combine the data frames using the pd.concat function.

Next, we use the sns.boxplot function, passing the combined data frame df_combined to it. On the x-axis are the values of the column 'Impression', and on the y-axis are the Сontrol and Test group. With the help of the matplotlib library, we sign the plot and axes.

Even after boxplots, it is not clear whether the distributions are normal. But in normality, we need to be sure.

How to do it? Statistical tests come to the rescue, which we will discuss in the next chapter.

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 2. Luku 4
some-alt