Course Content
Ultimate Visualization with Python
Ultimate Visualization with Python
Box Plot
Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.
Quartiles
Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:
- The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);
- The second quartile (Q2) is the median itself (50% of the data lies below the median);
- The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).
Let’s have a look at an example of a box blot:
This box plot is based on the data of the GDP per capita in different countries.
Box Plot Elements
- The upper side of the blue rectangle represents the third (upper) quartile and the lower side represents the first quartile;
- Q3- Q1 is called the interquartile range (IR) which is represented by the rectangle where the green line is the median;
- The black lines outside the rectangle called whiskers. The lower one represents Q1 -1.5* IR, and the upper one represents Q3 +1.5* IR;
- The data points which are outside the whiskers are called outliers (in this example there are quite a lot of them).
Now it's time to create a box plot with the help of matplotlib
:
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yealy temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
Box Plot Data
As you can see, everything is rather simple here. You simply need to use the boxplot()
function from the pyplot
module with the first (the only required) parameter (called x
) being your data. It can either be an array-like (here Series
), a 2D array (a box plot is drawn for each column) or a sequence of 1D arrays (a box plot is drawn for each array).
Optional Parameters
There are also quite a bit of optional parameters for customizing the box plot, which you can explore here, yet in practice you might rarely use them.
tick_labels
parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
Here we slightly modified our example by passing the entire DataFrame
, which has 2 columns, and labeling each box plot appropriately.
Swipe to show code editor
Your task is to create two box plots using two samples from the standard normal distribution:
- Use the correct function to create the box plots.
- Use the list of
normal_sample_1
andnormal_sample_2
(in this order from left to right) as the data. - Label the left box plot as
First sample
and the right one asSecond sample
using thelist
.
Solution
Thanks for your feedback!
Box Plot
Box plot is another extremely common plot in statistics used to visualize the central tendency, spread, and potential outliers within the data via their quartiles.
Quartiles
Quartiles divide the data points (sorted in ascending order) into four equal-sized parts. There are three of them:
- The first quartile (Q1) is the middle number between the smallest value (number) of the sample and median (25% of the data lies in this range);
- The second quartile (Q2) is the median itself (50% of the data lies below the median);
- The third quartile (Q3) is the middle number between the median of the sample and the highest values of the sample (75% of the data lies below the Q3).
Let’s have a look at an example of a box blot:
This box plot is based on the data of the GDP per capita in different countries.
Box Plot Elements
- The upper side of the blue rectangle represents the third (upper) quartile and the lower side represents the first quartile;
- Q3- Q1 is called the interquartile range (IR) which is represented by the rectangle where the green line is the median;
- The black lines outside the rectangle called whiskers. The lower one represents Q1 -1.5* IR, and the upper one represents Q3 +1.5* IR;
- The data points which are outside the whiskers are called outliers (in this example there are quite a lot of them).
Now it's time to create a box plot with the help of matplotlib
:
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yealy temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a box plot for the Seattle temperatures plt.boxplot(weather_df['Seattle']) plt.show()
Box Plot Data
As you can see, everything is rather simple here. You simply need to use the boxplot()
function from the pyplot
module with the first (the only required) parameter (called x
) being your data. It can either be an array-like (here Series
), a 2D array (a box plot is drawn for each column) or a sequence of 1D arrays (a box plot is drawn for each array).
Optional Parameters
There are also quite a bit of optional parameters for customizing the box plot, which you can explore here, yet in practice you might rarely use them.
tick_labels
parameter is an exception. This one in particular is useful not only to label a single box plot, but to label the box plots when there is more than one array:
import pandas as pd import matplotlib.pyplot as plt url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating two box plots for Boston and Seattle temperatures plt.boxplot(weather_df, tick_labels=['Boston', 'Seattle']) plt.show()
Here we slightly modified our example by passing the entire DataFrame
, which has 2 columns, and labeling each box plot appropriately.
Swipe to show code editor
Your task is to create two box plots using two samples from the standard normal distribution:
- Use the correct function to create the box plots.
- Use the list of
normal_sample_1
andnormal_sample_2
(in this order from left to right) as the data. - Label the left box plot as
First sample
and the right one asSecond sample
using thelist
.
Solution
Thanks for your feedback!