Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Comparing Means of Two Different Datasets | Testing of Statistical Hypotheses
Advanced Probability Theory
course content

Course Content

Advanced Probability Theory

Advanced Probability Theory

1. Additional Statements From The Probability Theory
2. The Limit Theorems of Probability Theory
3. Estimation of Population Parameters
4. Testing of Statistical Hypotheses

bookComparing Means of Two Different Datasets

A rather important applied task is to compare the mathematical expectations of two different independent numerical datasets.
In the general case, this task is solved rather non-trivially, but under certain conditions, this can be done relatively simply. Let's consider the following conditions:

We have two independent numerical datasets with Gaussian distributions with equal variances(we may not know the real value of the variance, but we have to be sure that variances are equal). We want to test the following hypothesis:

Main hypothesis: expectations of these datasets are equal.

Alternative hypothesis: the expectation of the X dataset is greater than that of the Y dataset.

Statistical criterion

If the conditions described above are met, then we can use the following criterion to check this hypothesis:

Python implementation

Let's generate two independent datasets with different mean values and try to check the hypothesis:

123456789101112131415161718192021222324252627282930313233
import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt # Generate two independent Gaussian datasets with different means and variances np.random.seed(123) # Set seed for reproducibility data1 = np.random.normal(loc=5, scale=2, size=500) # Generate first dataset data2 = np.random.normal(loc=4.7, scale=2, size=500) # Generate second dataset # Compute the two-sample t-test. By default, it checks the two-tailed hypothesis t_stat, p_value = stats.ttest_ind(data1, data2) # Define the significance level alpha = 0.05 # Compare the p-value with the significance level and print the result if p_value < alpha: print('Reject the null hypothesis that the means are equal') else: print('Fail to reject the null hypothesis that the means are equal') # Plot the critical regions of the t-test fig, ax = plt.subplots() # Create figure and axis objects x = np.linspace(-4, 4, 1000) # Generate x values for plotting y = stats.t.pdf(x, df=len(data1)+len(data2)-2) # Compute t-distribution PDF ax.plot(x, y, label='t-distribution') # Plot t-distribution t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2) # Compute left critical value t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2) # Compute right critical value ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left') # Plot left critical value ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right') # Plot right critical value ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic') # Plot t-statistic ax.legend() # Add legend to the plot plt.show() # Show the plot
copy

We see that the value of the criterion fell into the right critical region, so we conclude that the mathematical expectation of the first dataset is greater than the mathematical expectation of the second.

Datasets with different variances

There is also a generalization of this criterion in case the variances of the dataset are different, let's look at an example of how this can be implemented in code:

1234567891011121314151617181920
import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt # Generate two independent Gaussian datasets with different means and variances np.random.seed(123) data1 = np.random.normal(loc=5, scale=2, size=100) data2 = np.random.normal(loc=4.95, scale=4, size=100) # Compute the two-sample t-test t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False) # Define the significance level alpha = 0.05 # Compare the p-value with the significance level and print the result if p_value < alpha: print('Reject the null hypothesis that the means are equal') else: print('Fail to reject the null hypothesis that the means are equal')
copy

In the code above, we used equal_var=False as an argument of stats.ttest_ind method to provide hypothesis testing for datasets with different variances.

Can we use Student's t-tests with non-Gaussian data?

Can we use Student's t-tests with non-Gaussian data?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 3
some-alt