Course Content

Advanced Probability Theory

1. Additional Statements From The Probability Theory

Course Overview Absolutely Continuous and Discrete Random Variables Cumulative Distribution Functions and Probability Density Functions Characteristics of Random Variables Random Vectors Useful Properties of the Gaussian Distribution Challenge: Detecting Outliers Using 3-Sigma Rule

2. The Limit Theorems of Probability Theory

Law of Large Numbers Law of Large Numbers for Bernoulli Process Challenge: Estimate Mean Value Using Law of Large Numbers Central Limit Theorem Challenge: Application of the CLT to Solving Real Problem

3. Estimation of Population Parameters

General population. Samples. Population parameters.Momentum estimation. Maximum Likelihood Estimation Challenge: Estimate Parameters of Chi-square Distribution Unbiased Estimation Challenge: Checking Bias of An Estimation Using Simulation Consistent Estimation Efficient Estimation Confidence Intervals for Population Parameters Challenge: Confidence Interval for Exponential Distribution Parameter

4. Testing of Statistical Hypotheses

What is Statistic Hypothesis? Type 1 and Type 2 Errors What is P-value?Comparing Means of Two Different Datasets Challenge: Using CLT to Compare Mean Values of Non-Gaussian Datasets Challenge: Resampling Approach to Compare Mean Values of the Datasets Testing the Hypothesis of Independence of Two Random Variables

Comparing Means of Two Different Datasets

A rather important applied task is to compare the mathematical expectations of two different independent numerical datasets.
In the general case, this task is solved rather non-trivially, but under certain conditions, this can be done relatively simply. Let's consider the following conditions:

We have two independent numerical datasets with Gaussian distributions with equal variances(we may not know the real value of the variance, but we have to be sure that variances are equal). We want to test the following hypothesis:

Main hypothesis: expectations of these datasets are equal.

Alternative hypothesis: the expectation of the X dataset is greater than that of the Y dataset.

Statistical criterion

If the conditions described above are met, then we can use the following criterion to check this hypothesis:

Python implementation

Let's generate two independent datasets with different mean values and try to check the hypothesis:


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)  # Set seed for reproducibility
data1 = np.random.normal(loc=5, scale=2, size=500)  # Generate first dataset
data2 = np.random.normal(loc=4.7, scale=2, size=500)  # Generate second dataset

# Compute the two-sample t-test. By default, it checks the two-tailed hypothesis
t_stat, p_value = stats.ttest_ind(data1, data2)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

# Plot the critical regions of the t-test
fig, ax = plt.subplots()  # Create figure and axis objects
x = np.linspace(-4, 4, 1000)  # Generate x values for plotting
y = stats.t.pdf(x, df=len(data1)+len(data2)-2)  # Compute t-distribution PDF
ax.plot(x, y, label='t-distribution')  # Plot t-distribution
t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2)  # Compute left critical value
t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2)  # Compute right critical value
ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left')  # Plot left critical value
ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right')  # Plot right critical value
ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic')  # Plot t-statistic
ax.legend()  # Add legend to the plot
plt.show()  # Show the plot

We see that the value of the criterion fell into the right critical region, so we conclude that the mathematical expectation of the first dataset is greater than the mathematical expectation of the second.

Datasets with different variances

There is also a generalization of this criterion in case the variances of the dataset are different, let's look at an example of how this can be implemented in code:


              1234567891011121314151617181920
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)
data1 = np.random.normal(loc=5, scale=2, size=100)
data2 = np.random.normal(loc=4.95, scale=4, size=100)

# Compute the two-sample t-test
t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

In the code above, we used equal_var=False as an argument of stats.ttest_ind method to provide hypothesis testing for datasets with different variances.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat