Advanced Techniques in Probability and Statistics
The general population is a probabilistic distribution that describes a certain real-life process or a single characteristic. For example, the heights of adult men in the United States are approximately normally distributed with a mean of 70 inches and a standard deviation of 3 inches. In this example normal distribution with 70 inches mean and 3 inches standard deviation is a general population of men's height in USA.
A sample in statistics is a value that you obtain from a larger set of data to represent a whole general population. If we are talking about men's height in USA the samples can look like:
Thus each sample is essentially a random variable with a distribution given by the general population.
In the example above, we first set the general population type and parameters, then generated the corresponding samples. In real tasks of analytics and data science, we usually need to solve the inverse problem: we have samples generated from some general population, and we must determine from which particular population these samples were generated.
To do this, we need to follow the following list of steps:
Step 1. Firstly it is necessary to determine whether we are dealing with a discrete or continuous general population.
|Discrete samples||Continuous samples|
|4, 1, 10, 6, 23, 12||2.5, 33.3, 0.01, 10.002, 12, -1.1|
Step 2. It is necessary to estimate what type of distribution our data belongs to. It can be done using visualization: for discrete data, we build a frequency polygon, and for continuous data, a histogram. Further, we can assume that our data has a distribution with PMF/PDF, which is most similar to our frequency polygon/ histogram.
Step 3. As we mentioned in previous chapters, visualization is not enough to accurately determine the type of distribution. Therefore, after visualization, various statistical criteria are usually applied to more formally show that our data belongs to one or another general population.
Step 4. After you have determined the type of distribution, you need to estimate the parameters of this distribution. For example, if you assume from the histogram that the data is distributed normally, then you need to estimate the mean value and the variance; if you assume that the data is distributed exponentially, then you need to determine the lambda parameter, and so on. In addition to point estimation of parameters, confidence intervals are also constructed for the corresponding parameters.
In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.
Why do we need to build a histogram/frequency polygon of our samples?
Select the correct answer
Everything was clear?