# General population. Samples. Population parameters.

The **general population** represents how things are usually spread out in real life. For instance, the heights of adult men in the United States are generally around `70`

inches tall, with a variation of about `3`

inches. So, if we took a group of men in the USA, their heights would follow this pattern.

A **sample** is a small group we use to understand the bigger picture of the general population. For example, if we want to know the heights of men in the USA, we might measure the heights of a few men from different places. These measured heights are our samples.

Thus each sample is essentially **a random variable with a distribution given by the general population**.

In the example above, we first set the general population type and parameters, then generated the corresponding samples. In real tasks of analytics and data science, we usually need to **solve the inverse problem**: we have samples generated from some general population, and we must determine from which particular population these samples were generated.

### To do this, we need to follow the following list of steps:

**Step 1**. Firstly it is necessary to determine whether we are dealing with a **discrete** or **continuous** general population;

**Step 2**. It is necessary to estimate what **type of distribution** our data belongs to. It can be done using visualization: for discrete data, we build a **frequency polygon**, and for continuous data, a **histogram**. Further, we can assume that our data has a distribution with PMF/PDF, which is most similar to our frequency polygon/ histogram;

**Step 3**. As we mentioned in previous chapters, visualization is not enough to accurately determine the type of distribution. Therefore, after visualization, various **statistical criteria** are usually applied to more formally show that our data belongs to one or another general population;

**Step 4**. After you have determined the type of distribution, you need to estimate **the parameters of this distribution**. For example, if you assume from the histogram that the data is distributed normally, then you need to estimate the mean value and the variance; if you assume that the data is distributed exponentially, then you need to determine the lambda parameter, and so on. In addition to point estimation of parameters, **confidence intervals** are also constructed for the corresponding parameters.

In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.

Everything was clear?

Course Content

Probability Theory Mastering

## Probability Theory Mastering

1. Additional Statements From The Probability Theory

3. Estimation of Population Parameters

4. Testing of Statistical Hypotheses

# General population. Samples. Population parameters.

The **general population** represents how things are usually spread out in real life. For instance, the heights of adult men in the United States are generally around `70`

inches tall, with a variation of about `3`

inches. So, if we took a group of men in the USA, their heights would follow this pattern.

A **sample** is a small group we use to understand the bigger picture of the general population. For example, if we want to know the heights of men in the USA, we might measure the heights of a few men from different places. These measured heights are our samples.

Thus each sample is essentially **a random variable with a distribution given by the general population**.

In the example above, we first set the general population type and parameters, then generated the corresponding samples. In real tasks of analytics and data science, we usually need to **solve the inverse problem**: we have samples generated from some general population, and we must determine from which particular population these samples were generated.

### To do this, we need to follow the following list of steps:

**Step 1**. Firstly it is necessary to determine whether we are dealing with a **discrete** or **continuous** general population;

**Step 2**. It is necessary to estimate what **type of distribution** our data belongs to. It can be done using visualization: for discrete data, we build a **frequency polygon**, and for continuous data, a **histogram**. Further, we can assume that our data has a distribution with PMF/PDF, which is most similar to our frequency polygon/ histogram;

**Step 3**. As we mentioned in previous chapters, visualization is not enough to accurately determine the type of distribution. Therefore, after visualization, various **statistical criteria** are usually applied to more formally show that our data belongs to one or another general population;

**Step 4**. After you have determined the type of distribution, you need to estimate **the parameters of this distribution**. For example, if you assume from the histogram that the data is distributed normally, then you need to estimate the mean value and the variance; if you assume that the data is distributed exponentially, then you need to determine the lambda parameter, and so on. In addition to point estimation of parameters, **confidence intervals** are also constructed for the corresponding parameters.

In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.

Everything was clear?