Aprende Comparing Means of Two Different Datasets | Testing of Statistical Hypotheses

Desliza para mostrar el menú

A rather important applied task is to compare the mathematical expectations of two different independent numerical datasets.
In the general case, this task is solved rather non-trivially, but under certain conditions, this can be done relatively simply. Let's consider the following conditions:

We have two independent numerical datasets with Gaussian distributions with equal variances(we may not know the real value of the variance, but we have to be sure that variances are equal). We want to test the following hypothesis:

Main hypothesis: expectations of these datasets are equal.

Alternative hypothesis: the expectation of the X dataset is greater than that of the Y dataset.

Statistical criterion

If the conditions described above are met, then we can use the following criterion to check this hypothesis:

Python implementation

Let's generate two independent datasets with different mean values and try to check the hypothesis:


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)  # Set seed for reproducibility
data1 = np.random.normal(loc=5, scale=2, size=500)  # Generate first dataset
data2 = np.random.normal(loc=4.7, scale=2, size=500)  # Generate second dataset

# Compute the two-sample t-test. By default, it checks the two-tailed hypothesis
t_stat, p_value = stats.ttest_ind(data1, data2)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

# Plot the critical regions of the t-test
fig, ax = plt.subplots()  # Create figure and axis objects
x = np.linspace(-4, 4, 1000)  # Generate x values for plotting
y = stats.t.pdf(x, df=len(data1)+len(data2)-2)  # Compute t-distribution PDF
ax.plot(x, y, label='t-distribution')  # Plot t-distribution
t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2)  # Compute left critical value
t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2)  # Compute right critical value
ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left')  # Plot left critical value
ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right')  # Plot right critical value
ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic')  # Plot t-statistic
ax.legend()  # Add legend to the plot
plt.show()  # Show the plot

We see that the value of the criterion fell into the right critical region, so we conclude that the mathematical expectation of the first dataset is greater than the mathematical expectation of the second.

Datasets with different variances

There is also a generalization of this criterion in case the variances of the dataset are different, let's look at an example of how this can be implemented in code:


              1234567891011121314151617181920
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)
data1 = np.random.normal(loc=5, scale=2, size=100)
data2 = np.random.normal(loc=4.95, scale=4, size=100)

# Compute the two-sample t-test
t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

In the code above, we used equal_var=False as an argument of stats.ttest_ind method to provide hypothesis testing for datasets with different variances.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 4. Capítulo 3

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 4. Capítulo 3