Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
What is P-value? | Testing of Statistical Hypotheses
Probability Theory Mastering
course content

Course Content

Probability Theory Mastering

Probability Theory Mastering

1. Additional Statements From The Probability Theory
2. The Limit Theorems of Probability Theory
3. Estimation of Population Parameters
4. Testing of Statistical Hypotheses

bookWhat is P-value?

The P-value is a probability value used in statistical hypothesis testing. It is the probability of obtaining a test statistic at least as extreme as the one calculated from the sample data, assuming the null hypothesis is true. Thus, thanks to the p-value, we can determine whether the value of our criterion fell into the critical region

Hypothesis testing guideline

Step 1. We have samples and formulations of the main and alternative hypotheses. Firstly we define the significance level (probability of type 1 mistake) which will satisfy us;

Step 2. We choose the criterion by which we will test the hypothesis. Knowing the distribution of our initial data, we determine how the values ​​of this criterion will be distributed;

Step 3. We consider the value of the criterion (it is also called test statistic) for our particular samples, after which we determine the p-value;

Note

If we cannot determine the real distribution of the criterion, then we can use the empirical. One of the methods for constructing the empirical distribution will be discussed in the penultimate chapter of this section.

Step 4. We reject the main hypothesis if the obtained p-value is less than the significance level. If the p-value is greater than the significance level - we conclude that the main hypothesis is right. We still reject the main hypothesis if the p-value differs very little from the given significance level.

Nevertheless, to test most of the hypotheses, the corresponding methods have already been implemented, so we do not need to complete all the steps but just get the p-value and compare it with a chosen significance level.

Example

Let's look at an example. In Section 3 Chapter 2, we estimated the parameters of the population based on the samples, making the assumption about the population's distribution. Let's now check if our data is normal / exponentially distributed with the found parameters.

123456789101112131415161718192021222324
from scipy.stats import kstest, norm, expon import pandas as pd import numpy as np gaussian_samples = np.array(pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/gaussian_samples.csv', names=['Value'])) expon_samples = np.array(pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/expon_samples.csv', names=['Value'])) # Specify significance level alpha = 0.05 # Perform Kolmogorov-Smirnov test for normal distribution with estimated params. Main hypothesis is that distibutions are equal # By default two-tailed hypothesis is tested: the alternative hypothesis is that distributions are not equal test_statistic, p_value = kstest(gaussian_samples.flatten(), cdf=norm(loc=-0.042, scale=3.964).cdf) if p_value > alpha: print('Data follows a normal distribution') else: print('Data does not follow a normal distribution') # Perform Kolmogorov-Smirnov test for exponential distribution with estimated param test_statistic, p_value = kstest(expon_samples.flatten(), cdf=expon(scale=1/ 0.497).cdf) if p_value > alpha: print('Data follows an exponential distribution') else: print('Data does not follow an exponential distribution')
copy

In the code above we:

  1. Imported necessary datasets and specified significance level alpha;
  2. Used Kolmogorov-Smirnov criterion to check the hypothesis about the distribution of our samples;
    • used kstest function to get criterion value and p-value;
    • used our data as the first argument of kstest function and the CDF of the normal/exponential distribution with specified parameters as the second argument.
  3. Compared p_value with alpha to accept/reject the main hypothesis.

Note

There are many statistical tests to test the distribution of samples. The most popular are the Shapiro-Wilk test (scipy.stats.shapiro) , Anderson-Darling test (scipy.stats.anderson), Chi-squared goodness of fit test (scipy.stats.chisquare)

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
some-alt