Box Plot
Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.
Quartiles
Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:
The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);
The second quartile (Q2) is the median itself (50% of the data lies below the median);
The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).
Box Plot Elements
The right side of the red rectangle represents the third quartile and the left side represents the first quartile;
Q3 - Q1 is called the interquartile range (IQR) which is represented by the rectangle where the yellow line is the median;
The black lines outside the rectangle called whiskers. The left one represents , and the right one represents ;
The data points which are outside the whiskers are called outliers.
The next step is to generate a box plot using the matplotlib
library:
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
Box Plot Data
The boxplot()
function from the pyplot
module is used with the first and only required parameter x
representing the data. This data can be an array-like object (e.g., a Series
), a 2D array (a box plot is drawn for each column), or a sequence of 1D arrays (a box plot is drawn for each array).
Optional Parameters
tick_labels
parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
In this example, the entire DataFrame
with two columns was passed to boxplot()
, resulting in separate box plots for each column with appropriate labels automatically assigned.
There are also quite a bit of optional parameters for customizing the box plot, which you can explore in boxplot()
documentation, yet in practice you might rarely use them.
Swipe to start coding
Create two box plots using two samples from the standard normal distribution:
- Use the correct function to create the box plots.
- Use the list of
normal_sample_1
andnormal_sample_2
(in this order from left to right) as the data. - Label the left box plot as
First sample
and the right one asSecond sample
using thelist
.
Solution
Thanks for your feedback!