Conteúdo do Curso

Advanced Probability Theory

1. Additional Statements From The Probability Theory

Course Overview Absolutely Continuous and Discrete Random Variables Cumulative Distribution Functions and Probability Density Functions Characteristics of Random Variables Random Vectors Useful Properties of the Gaussian Distribution Challenge: Detecting Outliers Using 3-Sigma Rule

2. The Limit Theorems of Probability Theory

Law of Large Numbers Law of Large Numbers for Bernoulli Process Challenge: Estimate Mean Value Using Law of Large Numbers Central Limit Theorem Challenge: Application of the CLT to Solving Real Problem

3. Estimation of Population Parameters

General population. Samples. Population parameters.Momentum estimation. Maximum Likelihood Estimation Challenge: Estimate Parameters of Chi-square Distribution Unbiased Estimation Challenge: Checking Bias of An Estimation Using Simulation Consistent Estimation Efficient Estimation Confidence Intervals for Population Parameters Challenge: Confidence Interval for Exponential Distribution Parameter

4. Testing of Statistical Hypotheses

What is Statistic Hypothesis? Type 1 and Type 2 Errors What is P-value?Comparing Means of Two Different Datasets Challenge: Using CLT to Compare Mean Values of Non-Gaussian Datasets Challenge: Resampling Approach to Compare Mean Values of the Datasets Testing the Hypothesis of Independence of Two Random Variables

Testing the Hypothesis of Independence of Two Random Variables

In real-life tasks, it is often needed to analyze the dependence between different features. For example:

Gender and political party affiliation: We can test whether there is a relationship between gender and political party affiliation;
Education level and job satisfaction: We can test whether there is a relationship between education level and job satisfaction;
Age and voting behavior: We can test whether there is a relationship between age and voting behavior;
Income level and preferred mode of transportation: We can test whether there is a relationship between income level and preferred mode of transportation.

But how can we prove that the variables are independent if we are not dealing with the entire population but only with small samples of the corresponding variables? For this, we can use the chi-square independence criterion.

Hypothesis formulation

We can use this criterion to test the following hypothesis:
Main hypothesis: corresponding random variables are independent of each other.
Alternative hypothesis: there are some relationships between the considered random variables

Contingency table

To use the chi-square independence test we have to provide some data preprocessing - create a contingency table. A contingency table, also known as a cross-tabulation table, is a table used to summarize the categorical data from two or more variables. The table presents the joint distribution of the variables, including the frequency or count of each combination of categories for the variables. Let's look at the example:


              1234567891011
            
import pandas as pd

# Example data
data = {'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
        'Smoker': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data)

# Create contingency table using pandas crosstab
cont_table = pd.crosstab(df['Gender'], df['Smoker'])

print(cont_table)

The contingency matrix for continuous random variables is built a little differently. We first split our values into several discrete subsets and only then build the contingency matrix, for example:


              1234567891011121314151617
            
import pandas as pd

# Create a sample dataset
data = pd.DataFrame({'age': [22, 45, 32, 19, 28, 57, 39, 41, 36, 24],
                     'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]})

# Define the bins for age
bins = [18, 25, 35, 45, 55, 65, 75]

# Create a new column with the age bins
data['age_group'] = pd.cut(data['age'], bins)
# Create a new column with the income bins splitted on 3 equal parts
data['income_group'] = pd.cut(data['income'], 3)
# Create a contingency table
contingency_table = pd.crosstab(data['age_group'], data['income_group'])

print(contingency_table)

Chi-square independence criterion in Python

Finally, let's use the chi-square independence criterion to check independence on a real dataset.


              12345678910111213141516171819202122232425
            
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/heart.csv')
# Calculate correlation between target and other features
for i in data.columns:
  print('Covariance between ', i, ' and target is:', data.corr()['target'].loc[i])

# If the correlation is not close to zero than features have some linear relationships
# fbs and target have very small correlation so there are no linear dependencies between them
# Let's check hypothesis that fbs and hear disease occurrence are independent

# Choose significance level
alpha = 0.05

# Contingency table for discrete target and fbs
cont_table = pd.crosstab(data['fbs'], data['target'])

# Provide chi2 independence test
chi2_stat, p_val, dof, expected  = chi2_contingency(cont_table, correction=True)

if p_val < alpha:
  print('\n Fbs and heart disease occurrence are dependant')
else:
  print('\n Fbs and heart disease occurrence are independant')

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 6

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo