Course Content
Probability Theory Mastering
Probability Theory Mastering
What is P-value?
The P-value is a probability value used in statistical hypothesis testing. It is the probability of obtaining a test statistic at least as extreme as the one calculated from the sample data, assuming the null hypothesis is true. Thus, thanks to the p-value, we can determine whether the value of our criterion fell into the critical region
Hypothesis testing guideline
Step 1. We have samples and formulations of the main and alternative hypotheses. Firstly we define the significance level (probability of type 1 mistake) which will satisfy us;
Step 2. We choose the criterion by which we will test the hypothesis. Knowing the distribution of our initial data, we determine how the values of this criterion will be distributed;
Step 3. We consider the value of the criterion (it is also called test statistic) for our particular samples, after which we determine the p-value;
Note
If we cannot determine the real distribution of the criterion, then we can use the empirical. One of the methods for constructing the empirical distribution will be discussed in the penultimate chapter of this section.
Step 4. We reject the main hypothesis if the obtained p-value is less than the significance level. If the p-value is greater than the significance level - we conclude that the main hypothesis is right. We still reject the main hypothesis if the p-value differs very little from the given significance level.
Nevertheless, to test most of the hypotheses, the corresponding methods have already been implemented, so we do not need to complete all the steps but just get the p-value and compare it with a chosen significance level.
Example
Let's look at an example. In Section 3 Chapter 2, we estimated the parameters of the population based on the samples, making the assumption about the population's distribution. Let's now check if our data is normal / exponentially distributed with the found parameters.
from scipy.stats import kstest, norm, expon import pandas as pd import numpy as np gaussian_samples = np.array(pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/gaussian_samples.csv', names=['Value'])) expon_samples = np.array(pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/expon_samples.csv', names=['Value'])) # Specify significance level alpha = 0.05 # Perform Kolmogorov-Smirnov test for normal distribution with estimated params. Main hypothesis is that distibutions are equal # By default two-tailed hypothesis is tested: the alternative hypothesis is that distributions are not equal test_statistic, p_value = kstest(gaussian_samples.flatten(), cdf=norm(loc=-0.042, scale=3.964).cdf) if p_value > alpha: print('Data follows a normal distribution') else: print('Data does not follow a normal distribution') # Perform Kolmogorov-Smirnov test for exponential distribution with estimated param test_statistic, p_value = kstest(expon_samples.flatten(), cdf=expon(scale=1/ 0.497).cdf) if p_value > alpha: print('Data follows an exponential distribution') else: print('Data does not follow an exponential distribution')
In the code above we:
- Imported necessary datasets and specified significance level
alpha
; - Used Kolmogorov-Smirnov criterion to check the hypothesis about the distribution of our samples;
- used
kstest
function to get criterion value and p-value; - used our data as the first argument of
kstest
function and the CDF of the normal/exponential distribution with specified parameters as the second argument.
- used
- Compared
p_value
withalpha
to accept/reject the main hypothesis.
Note
There are many statistical tests to test the distribution of samples. The most popular are the Shapiro-Wilk test (
scipy.stats.shapiro
) , Anderson-Darling test (scipy.stats.anderson
), Chi-squared goodness of fit test (scipy.stats.chisquare
)
Thanks for your feedback!