Comparing Means of Two Different Datasets

A rather important applied task is to compare the mathematical expectations of two different independent numerical datasets.
In the general case, this task is solved rather non-trivially, but under certain conditions, this can be done relatively simply. Let's consider the following conditions:

We have two independent numerical datasets with Gaussian distributions with equal variances(we may not know the real value of the variance, but we have to be sure that variances are equal). We want to test the following hypothesis:

Main hypothesis: expectations of these datasets are equal.

Alternative hypothesis: the expectation of the X dataset is greater than that of the Y dataset.

Statistical criterion

If the conditions described above are met, then we can use the following criterion to check this hypothesis:

Python implementation

Let's generate two independent datasets with different mean values and try to check the hypothesis:

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances

np.random.seed(123) # Set seed for reproducibility

data1 = np.random.normal(loc=5, scale=2, size=500) # Generate first dataset

data2 = np.random.normal(loc=4.7, scale=2, size=500) # Generate second dataset

# Compute the two-sample t-test. By default, it checks the two-tailed hypothesis

t_stat, p_value = stats.ttest_ind(data1, data2)

# Define the significance level

alpha = 0.05

# Compare the p-value with the significance level and print the result

if p_value < alpha:

print('Reject the null hypothesis that the means are equal')

else:

print('Fail to reject the null hypothesis that the means are equal')

# Plot the critical regions of the t-test

fig, ax = plt.subplots() # Create figure and axis objects

x = np.linspace(-4, 4, 1000) # Generate x values for plotting

y = stats.t.pdf(x, df=len(data1)+len(data2)-2) # Compute t-distribution PDF

ax.plot(x, y, label='t-distribution') # Plot t-distribution

t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2) # Compute left critical value

t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2) # Compute right critical value

ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left') # Plot left critical value

ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right') # Plot right critical value

ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic') # Plot t-statistic

ax.legend() # Add legend to the plot

plt.show() # Show the plot


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)  # Set seed for reproducibility
data1 = np.random.normal(loc=5, scale=2, size=500)  # Generate first dataset
data2 = np.random.normal(loc=4.7, scale=2, size=500)  # Generate second dataset

# Compute the two-sample t-test. By default, it checks the two-tailed hypothesis
t_stat, p_value = stats.ttest_ind(data1, data2)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

# Plot the critical regions of the t-test
fig, ax = plt.subplots()  # Create figure and axis objects
x = np.linspace(-4, 4, 1000)  # Generate x values for plotting
y = stats.t.pdf(x, df=len(data1)+len(data2)-2)  # Compute t-distribution PDF
ax.plot(x, y, label='t-distribution')  # Plot t-distribution
t_crit_left = stats.t.ppf(alpha/2, len(data1)+len(data2)-2)  # Compute left critical value
t_crit_right = stats.t.ppf(1-alpha/2, len(data1)+len(data2)-2)  # Compute right critical value
ax.axvline(t_crit_left, color='r', linestyle='--', label='t-critical left')  # Plot left critical value
ax.axvline(t_crit_right, color='g', linestyle='--', label='t-critical right')  # Plot right critical value
ax.axvline(t_stat, color='k', linestyle='--', label='t-statistic')  # Plot t-statistic
ax.legend()  # Add legend to the plot
plt.show()  # Show the plot

We see that the value of the criterion fell into the right critical region, so we conclude that the mathematical expectation of the first dataset is greater than the mathematical expectation of the second.

Datasets with different variances

There is also a generalization of this criterion in case the variances of the dataset are different, let's look at an example of how this can be implemented in code:

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances

np.random.seed(123)

data1 = np.random.normal(loc=5, scale=2, size=100)

data2 = np.random.normal(loc=4.95, scale=4, size=100)

# Compute the two-sample t-test

t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)

# Define the significance level

alpha = 0.05

# Compare the p-value with the significance level and print the result

if p_value < alpha:

print('Reject the null hypothesis that the means are equal')

else:

print('Fail to reject the null hypothesis that the means are equal')


              1234567891011121314151617181920
            
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate two independent Gaussian datasets with different means and variances
np.random.seed(123)
data1 = np.random.normal(loc=5, scale=2, size=100)
data2 = np.random.normal(loc=4.95, scale=4, size=100)

# Compute the two-sample t-test
t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)

# Define the significance level
alpha = 0.05

# Compare the p-value with the significance level and print the result
if p_value < alpha:
    print('Reject the null hypothesis that the means are equal')
else:
    print('Fail to reject the null hypothesis that the means are equal')

In the code above, we used equal_var=False as an argument of stats.ttest_ind method to provide hypothesis testing for datasets with different variances.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 3

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo