Course Content
Advanced Probability Theory
Advanced Probability Theory
Central Limit Theorem
The central limit theorem is a fundamental statistic theorem that states that the sum of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the underlying distribution of the individual random variables.
Theorem formulation
Formal description of the theorem can be presented as follows:
As well as in the law of large numbers, we see that in the definition of the Central Limit Theorem, there is a letter 'd' above the arrow. This letter means the so-called convergence in distribution. In simple words, it can be interpreted as follows, the more terms we have, the more PDF of the sum of these terms will be similar to the PDF of Gaussian distribution.
Instead of the last line in the formulation above, another is often used:
In this formulation, we don't talk about convergence anymore. Instead, we assert that the sum follows a Gaussian distribution law with certain parameters right away. However, it's important to note that this approximation only holds for large values of n.
For each specific distribution, the required value of n differs, but generally, if n is not less than 35
, this approximation works with reasonably high accuracy.
Illustration of the theorem
Take a look at the illustration below: we'll calculate the PDF of the sum of uniformly distributed variables. As shown in the illustration, the resulting PDF becomes more similar to a Gaussian PDF as we use more and more terms to calculate the sum.
Now let's look at the PMF of the sum of Binomial variables:
CLT implementation
We'll create 500
samples, each containing hundreds of random variables from an exponential distribution.
For each of these 500
samples, we'll calculate the sum of its random variables and create a histogram from the resulting 500
values. Then, we'll compare this histogram with a PDF plot of a Gaussian random variable.
import numpy as np import matplotlib.pyplot as plt # List to store the sum of samples from each iteration hist_samples = [] # Generate 500 samples and calculate the sum of random variables in each sample for i in range(500): generated_samples = np.random.poisson(4, 100) # Generate 100 random variables from a Poisson distribution with mean 4 hist_samples.append(generated_samples.sum()) # Calculate the sum and append it to hist_samples # Plot a histogram of the samples and pdf of Gaussian distribution fig, axes = plt.subplots(1,2) # Create subplots fig.set_size_inches(10, 5) # Set the size of the figure # Plot histogram on the first subplot axes[0].hist(hist_samples, bins=10, alpha=0.5, edgecolor='black', density=True) axes[0].set_title('Histogram of Sum of Poisson Values') # Set title for the first subplot # Parameters for Gaussian distribution mean = 400 # Mean of one Poisson variable is 4, mean of sum is 400 std = 20 # Variance of one Poisson variable is 4, variance of sum 400, std 20 # Define the range of x values for the plot x = np.linspace(mean - 3 * std, mean + 3 * std, 500) # Calculate the pdf of the Gaussian distribution pdf = (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-((x - mean)**2) / (2 * std**2)) # Plot the pdf on the second subplot axes[1].plot(x, pdf) axes[1].set_title('Gaussian Distribution with Mean = {} and Variance = {}'.format(mean, std**2)) # Set title for the second subplot plt.show() # Display the plot
We can observe that the resulting histogram closely matches the PDF of the Gaussian distribution. This confirms the validity of the theorem, demonstrating its applicability in real-world scenarios!
Thanks for your feedback!