Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Testing the Hypothesis of Independence of Two Random Variables | Testing of Statistical Hypotheses
Probability Theory Mastering
course content

Course Content

Probability Theory Mastering

Probability Theory Mastering

1. Additional Statements From The Probability Theory
2. The Limit Theorems of Probability Theory
3. Estimation of Population Parameters
4. Testing of Statistical Hypotheses

bookTesting the Hypothesis of Independence of Two Random Variables

In real-life tasks, it is often needed to analyze the dependence between different features. For example:

  1. Gender and political party affiliation: We can test whether there is a relationship between gender and political party affiliation;
  2. Education level and job satisfaction: We can test whether there is a relationship between education level and job satisfaction;
  3. Age and voting behavior: We can test whether there is a relationship between age and voting behavior;
  4. Income level and preferred mode of transportation: We can test whether there is a relationship between income level and preferred mode of transportation.

But how can we prove that the variables are independent if we are not dealing with the entire population but only with small samples of the corresponding variables? For this, we can use the chi-square independence criterion.

Hypothesis formulation

We can use this criterion to test the following hypothesis:
Main hypothesis: corresponding random variables are independent of each other.
Alternative hypothesis: there are some relationships between the considered random variables

Contingency table

To use the chi-square independence test we have to provide some data preprocessing - create a contingency table. A contingency table, also known as a cross-tabulation table, is a table used to summarize the categorical data from two or more variables. The table presents the joint distribution of the variables, including the frequency or count of each combination of categories for the variables. Let's look at the example:

1234567891011
import pandas as pd # Example data data = {'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'], 'Smoker': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']} df = pd.DataFrame(data) # Create contingency table using pandas crosstab cont_table = pd.crosstab(df['Gender'], df['Smoker']) print(cont_table)
copy

The contingency matrix for continuous random variables is built a little differently. We first split our values into several discrete subsets and only then build the contingency matrix, for example:

1234567891011121314151617
import pandas as pd # Create a sample dataset data = pd.DataFrame({'age': [22, 45, 32, 19, 28, 57, 39, 41, 36, 24], 'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]}) # Define the bins for age bins = [18, 25, 35, 45, 55, 65, 75] # Create a new column with the age bins data['age_group'] = pd.cut(data['age'], bins) # Create a new column with the income bins splitted on 3 equal parts data['income_group'] = pd.cut(data['income'], 3) # Create a contingency table contingency_table = pd.crosstab(data['age_group'], data['income_group']) print(contingency_table)
copy

Chi-square independence criterion in Python

Finally, let's use the chi-square independence criterion to check independence on a real dataset.

12345678910111213141516171819202122232425
import pandas as pd from scipy.stats import chi2_contingency data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/heart.csv') # Calculate correlation between target and other features for i in data.columns: print('Covariance between ', i, ' and target is:', data.corr()['target'].loc[i]) # If the correlation is not close to zero than features have some linear relationships # fbs and target have very small correlation so there are no linear dependencies between them # Let's check hypothesis that fbs and hear disease occurrence are independent # Choose significance level alpha = 0.05 # Contingency table for discrete target and fbs cont_table = pd.crosstab(data['fbs'], data['target']) # Provide chi2 independence test chi2_stat, p_val, dof, expected = chi2_contingency(cont_table, correction=True) if p_val < alpha: print('\n Fbs and heart disease occurrence are dependant') else: print('\n Fbs and heart disease occurrence are independant')
copy
Can we create contingency table for continuous variables?

Can we create contingency table for continuous variables?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 6
some-alt