Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Box Plot | More Statistical Plots
Ultimate Visualization with Python

Swipe to show menu

book
Box Plot

Note
Definition

Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.

Quartiles

Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:

  • The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);

  • The second quartile (Q2) is the median itself (50% of the data lies below the median);

  • The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).

Box Plot Elements

  • The right side of the red rectangle represents the third quartile and the left side represents the first quartile;

  • Q3 - Q1 is called the interquartile range (IQR) which is represented by the rectangle where the yellow line is the median;

  • The black lines outside the rectangle called whiskers. The left one represents Q1βˆ’1.5β‹…IR\text{Q1} - 1.5 \cdot \text{IR}, and the right one represents Q3+1.5β‹…IR\text{Q3} + 1.5 \cdot \text{IR};

  • The data points which are outside the whiskers are called outliers.

The next step is to generate a box plot using the matplotlib library:

1234567891011
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
copy

Box Plot Data

The boxplot() function from the pyplot module is used with the first and only required parameter x representing the data. This data can be an array-like object (e.g., a Series), a 2D array (a box plot is drawn for each column), or a sequence of 1D arrays (a box plot is drawn for each array).

Optional Parameters

tick_labels parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:

12345678910
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
copy

In this example, the entire DataFrame with two columns was passed to boxplot(), resulting in separate box plots for each column with appropriate labels automatically assigned.

Note
Study More

There are also quite a bit of optional parameters for customizing the box plot, which you can explore in boxplot() documentation, yet in practice you might rarely use them.

Task

Swipe to start coding

Create two box plots using two samples from the standard normal distribution:

  1. Use the correct function to create the box plots.
  2. Use the list of normal_sample_1 and normal_sample_2 (in this order from left to right) as the data.
  3. Label the left box plot as First sample and the right one as Second sample using the list.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2
We're sorry to hear that something went wrong. What happened?

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

book
Box Plot

Note
Definition

Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.

Quartiles

Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:

  • The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);

  • The second quartile (Q2) is the median itself (50% of the data lies below the median);

  • The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).

Box Plot Elements

  • The right side of the red rectangle represents the third quartile and the left side represents the first quartile;

  • Q3 - Q1 is called the interquartile range (IQR) which is represented by the rectangle where the yellow line is the median;

  • The black lines outside the rectangle called whiskers. The left one represents Q1βˆ’1.5β‹…IR\text{Q1} - 1.5 \cdot \text{IR}, and the right one represents Q3+1.5β‹…IR\text{Q3} + 1.5 \cdot \text{IR};

  • The data points which are outside the whiskers are called outliers.

The next step is to generate a box plot using the matplotlib library:

1234567891011
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
copy

Box Plot Data

The boxplot() function from the pyplot module is used with the first and only required parameter x representing the data. This data can be an array-like object (e.g., a Series), a 2D array (a box plot is drawn for each column), or a sequence of 1D arrays (a box plot is drawn for each array).

Optional Parameters

tick_labels parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:

12345678910
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
copy

In this example, the entire DataFrame with two columns was passed to boxplot(), resulting in separate box plots for each column with appropriate labels automatically assigned.

Note
Study More

There are also quite a bit of optional parameters for customizing the box plot, which you can explore in boxplot() documentation, yet in practice you might rarely use them.

Task

Swipe to start coding

Create two box plots using two samples from the standard normal distribution:

  1. Use the correct function to create the box plots.
  2. Use the list of normal_sample_1 and normal_sample_2 (in this order from left to right) as the data.
  3. Label the left box plot as First sample and the right one as Second sample using the list.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
We're sorry to hear that something went wrong. What happened?
some-alt