Course Content
Advanced Probability Theory
Advanced Probability Theory
Comparing Means of Two Different Datasets
A rather important applied task is to compare the mathematical expectations of two different independent numerical datasets.
In the general case, this task is solved rather non-trivially, but under certain conditions, this can be done relatively simply.
Let's consider the following conditions:
We have two independent numerical datasets with Gaussian distributions with equal variances(we may not know the real value of the variance, but we have to be sure that variances are equal). We want to test the following hypothesis:
Main hypothesis: expectations of these datasets are equal.
Alternative hypothesis: the expectation of the X dataset is greater than that of the Y dataset.
Statistical criterion
If the conditions described above are met, then we can use the following criterion to check this hypothesis:
Python implementation
Let's generate two independent datasets with different mean values and try to check the hypothesis:
import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt # Generate two independent Gaussian datasets with different means and variances np.random.seed(123) # Set seed for reproducibility data1 = np.random.normal(loc=5, scale=2, size=500) # Generate first dataset data2 = np.random.normal(loc=4.7, scale=2, size=500) # Generate second dataset # Compute the two-sample t-test. By default, it checks the two-tailed hypothesis t_stat, p_value = stats.ttest_ind(data1, data2) # Define the significance level alpha = 0.05 # Compare the p-value with the significance level and print the result if p_value < alpha: print('Reject the null hypothesis that the means are equal') else: print('Fail to reject the null hypothesis that the means are equal') # Plot the critical regions of the t-test fig, ax = plt.subplots() # Create figure and axis objects x = np.linspace(-4, 4, 1000) # Generate x values for plotting y = stats.t.pdf(x, df=len(data1)+len(data2)-2) # Compute t-distribution PDF ax.plot(x, y, label='t-distribution') # Plot t-distribution t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2) # Compute left critical value t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2) # Compute right critical value ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left') # Plot left critical value ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right') # Plot right critical value ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic') # Plot t-statistic ax.legend() # Add legend to the plot plt.show() # Show the plot
We see that the value of the criterion fell into the right critical region, so we conclude that the mathematical expectation of the first dataset is greater than the mathematical expectation of the second.
Datasets with different variances
There is also a generalization of this criterion in case the variances of the dataset are different, let's look at an example of how this can be implemented in code:
import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt # Generate two independent Gaussian datasets with different means and variances np.random.seed(123) data1 = np.random.normal(loc=5, scale=2, size=100) data2 = np.random.normal(loc=4.95, scale=4, size=100) # Compute the two-sample t-test t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False) # Define the significance level alpha = 0.05 # Compare the p-value with the significance level and print the result if p_value < alpha: print('Reject the null hypothesis that the means are equal') else: print('Fail to reject the null hypothesis that the means are equal')
In the code above, we used equal_var=False
as an argument of stats.ttest_ind
method to provide hypothesis testing for datasets with different variances.
Thanks for your feedback!