Swipe to show menu

Histogram

Definition

Histograms represent the frequency or probability distribution of a variable by using vertical bins of equal width, often referred to as bars.

The pyplot module provides the hist function to create histograms. The required parameter is the data (x), which can be an array or a sequence of arrays. If multiple arrays are passed, each is shown in a different color.


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt

# Loading the dataset with the average yearly temperatures in Boston and Seattle
url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Creating a histogram
plt.hist(weather_df['Seattle'])
plt.show()

Intervals and Height

A Series object containing average yearly temperatures in Seattle was passed to the hist() function. By default, the data is divided into 10 equal intervals ranging from the minimum to the maximum value. However, only 9 bins are visible because the second interval contains no data points.

The height of each bin by default is equal to the frequency of the values in this interval (number of times they occur).

Number of Bins

Another important, yet optional parameter is bins which takes either the number of bins (integer) or a sequence of numbers specifying the edges of the bins or a string. Most of the time passing the number of bins is more than enough.

There are several methods for determining the width of histogram bins. In this example, we'll use Sturges' formula, which calculates the optimal number of bins based on the sample size:

Here, n is the size of the data array.

Study More

You can explore additional methods for bin calculation here.


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Specifying the number of bins
plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))))
plt.show()

The number of rows in the DataFrame is 26 (the size of the Series), so the resulting number of bins is 5.

Probability Density Approximation

To view an approximation of the probability density, set the density parameter to True in the hist function.

Now, each bin's height is calculated using:

\text{Height} = \frac{m}{n \times w}

where:

$n$ - the total number of values in the dataset;
$m$ - the number of values in bin;
$w$ - width of the bin.

This ensures that the total area under the histogram is 1, which matches the key property of a probability density function (PDF).


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Making a histogram a probability density function approximation
plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))), density=True)
plt.show()

This provides an approximation of the probability density function for the temperature data.

Study More

If you want to explore more about the hist() parameters, you can refer to hist() documentation.

Task

Swipe to start coding

Create an approximation of a probability density function using a sample from the standard normal distribution:

Use the correct function for creating a histogram.
Use normal_sample as the data for the histogram.
Specify the number of bins as the second argument using the Sturges' formula.
Make the histogram an approximation of a probability density function via correctly specifying the rightmost argument.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 1

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Histogram

Definition

Histograms represent the frequency or probability distribution of a variable by using vertical bins of equal width, often referred to as bars.


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt

# Loading the dataset with the average yearly temperatures in Boston and Seattle
url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Creating a histogram
plt.hist(weather_df['Seattle'])
plt.show()

Intervals and Height

The height of each bin by default is equal to the frequency of the values in this interval (number of times they occur).

Number of Bins

There are several methods for determining the width of histogram bins. In this example, we'll use Sturges' formula, which calculates the optimal number of bins based on the sample size:

Here, n is the size of the data array.

Study More

You can explore additional methods for bin calculation here.


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Specifying the number of bins
plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))))
plt.show()

The number of rows in the DataFrame is 26 (the size of the Series), so the resulting number of bins is 5.

Probability Density Approximation

To view an approximation of the probability density, set the density parameter to True in the hist function.

Now, each bin's height is calculated using:

\text{Height} = \frac{m}{n \times w}

where:

$n$ - the total number of values in the dataset;
$m$ - the number of values in bin;
$w$ - width of the bin.

This ensures that the total area under the histogram is 1, which matches the key property of a probability density function (PDF).


              12345678910
            
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

url = 'https://content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv'
weather_df = pd.read_csv(url, index_col=0)

# Making a histogram a probability density function approximation
plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))), density=True)
plt.show()

This provides an approximation of the probability density function for the temperature data.

Study More

If you want to explore more about the hist() parameters, you can refer to hist() documentation.

Task

Swipe to start coding

Create an approximation of a probability density function using a sample from the standard normal distribution:

Use the correct function for creating a histogram.
Use normal_sample as the data for the histogram.
Specify the number of bins as the second argument using the Sturges' formula.
Make the histogram an approximation of a probability density function via correctly specifying the rightmost argument.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Swipe to show menu

Histogram

Intervals and Height

Number of Bins

Probability Density Approximation

Solution

Awesome!

Histogram

Intervals and Height

Number of Bins

Probability Density Approximation

Solution

Awesome!