Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Box Plot | More Statistical Plots
Ultimate Visualization with Python
course content

Course Content

Ultimate Visualization with Python

Ultimate Visualization with Python

1. Matplotlib Introduction
2. Creating Commonly Used Plots
3. Plots Customization
4. More Statistical Plots
5. Plotting with Seaborn

book
Box Plot

Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.

Quartiles

Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:

  • The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);
  • The second quartile (Q2) is the median itself (50% of the data lies below the median);
  • The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).

Let’s have a look at an example of a box blot:

This box plot is based on the data of the GDP per capita in different countries.

Box Plot Elements

  • The upper side of the blue rectangle represents the third (upper) quartile and the lower side represents the first quartile;
  • Q3- Q1 is called the interquartile range (IR) which is represented by the rectangle where the green line is the median;
  • The black lines outside the rectangle called whiskers. The lower one represents Q1 -1.5* IR, and the upper one represents Q3 +1.5* IR;
  • The data points which are outside the whiskers are called outliers (in this example there are quite a lot of them).

Now it's time to create a box plot with the help of matplotlib:

12345678
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yealy temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
copy

Box Plot Data

As you can see, everything is rather simple here. You simply need to use the boxplot() function from the pyplot module with the first (the only required) parameter (called x) being your data. It can either be an array-like (here Series), a 2D array (a box plot is drawn for each column) or a sequence of 1D arrays (a box plot is drawn for each array).

Optional Parameters

There are also quite a bit of optional parameters for customizing the box plot, which you can explore here, yet in practice you might rarely use them.

tick_labels parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:

1234567
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
copy

Here we slightly modified our example by passing the entire DataFrame, which has 2 columns, and labeling each box plot appropriately.

Task
test

Swipe to show code editor

Your task is to create two box plots using two samples from the standard normal distribution:

  1. Use the correct function to create the box plots.
  2. Use the list of normal_sample_1 and normal_sample_2 (in this order from left to right) as the data.
  3. Label the left box plot as First sample and the right one as Second sample using the list.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
toggle bottom row

book
Box Plot

Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.

Quartiles

Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:

  • The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);
  • The second quartile (Q2) is the median itself (50% of the data lies below the median);
  • The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).

Let’s have a look at an example of a box blot:

This box plot is based on the data of the GDP per capita in different countries.

Box Plot Elements

  • The upper side of the blue rectangle represents the third (upper) quartile and the lower side represents the first quartile;
  • Q3- Q1 is called the interquartile range (IR) which is represented by the rectangle where the green line is the median;
  • The black lines outside the rectangle called whiskers. The lower one represents Q1 -1.5* IR, and the upper one represents Q3 +1.5* IR;
  • The data points which are outside the whiskers are called outliers (in this example there are quite a lot of them).

Now it's time to create a box plot with the help of matplotlib:

12345678
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yealy temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
copy

Box Plot Data

As you can see, everything is rather simple here. You simply need to use the boxplot() function from the pyplot module with the first (the only required) parameter (called x) being your data. It can either be an array-like (here Series), a 2D array (a box plot is drawn for each column) or a sequence of 1D arrays (a box plot is drawn for each array).

Optional Parameters

There are also quite a bit of optional parameters for customizing the box plot, which you can explore here, yet in practice you might rarely use them.

tick_labels parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:

1234567
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
copy

Here we slightly modified our example by passing the entire DataFrame, which has 2 columns, and labeling each box plot appropriately.

Task
test

Swipe to show code editor

Your task is to create two box plots using two samples from the standard normal distribution:

  1. Use the correct function to create the box plots.
  2. Use the list of normal_sample_1 and normal_sample_2 (in this order from left to right) as the data.
  3. Label the left box plot as First sample and the right one as Second sample using the list.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
We're sorry to hear that something went wrong. What happened?
some-alt