Learn General population. Samples. Population parameters.

The general population represents how things are usually spread out in real life. For instance, the heights of adult men in the United States are generally around 70 inches tall, with a variation of about 3 inches. So, if we took a group of men in the USA, their heights would follow this pattern.

A sample is a small group we use to understand the bigger picture of the general population. For example, if we want to know the heights of men in the USA, we might measure the heights of a few men from different places. These measured heights are our samples.


              123456789
            
import numpy as np
# Specify parameters of general population
mean = 70
std = 3
# Specify number of sumples to generate
size = 10
# Generate samples
samples = np.random.normal(mean, std, size)
print('Samples are: ', samples)

Thus each sample is essentially a random variable with a distribution given by the general population.
In the example above, we first set the general population type and parameters, then generated the corresponding samples. In real tasks of analytics and data science, we usually need to solve the inverse problem: we have samples generated from some general population, and we must determine from which particular population these samples were generated.

To do this, we need to follow the following list of steps:

Step 1. Firstly it is necessary to determine whether we are dealing with a discrete or continuous general population;

Step 2. It is necessary to estimate what type of distribution our data belongs to. It can be done using visualization: for discrete data, we build a frequency polygon, and for continuous data, a histogram. Further, we can assume that our data has a distribution with PMF/PDF, which is most similar to our frequency polygon/ histogram;


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
import matplotlib.pyplot as plt

# Generating 1000 samples from a continuous normal distribution with mean 70 and standard deviation 3
samples_cont = np.random.normal(70, 3, 1000)
# Generate 500 samples from a discrete distribution
samples_disc = np.random.choice(['Red', 'Blue', 'Green', 'Black', 'White'], size=500, p=[0.3, 0.2, 0.15, 0.15, 0.2]) 

# Creating the figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plotting the histogram on the first subplot
axes[0].hist(samples_cont, bins=20, alpha=0.5, color='blue', density=True)
axes[0].set_xlabel('Values')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Continuous Variable')

# Plotting the frequency polygon on the second subplot

# Calculate the empirical probabilities
counts = np.unique(samples_disc, return_counts=True)[1]
probs = counts / len(samples_disc)

# Plot the frequency polygon
axes[1].plot(['Red', 'Blue', 'Green', 'Black', 'White'], probs, marker='o', linestyle='--')
axes[1].set_title('Frequency Polygon')
axes[1].set_xlabel('Color')
axes[1].set_ylabel('Estimated Probability')

# Adjusting the layout and displaying the plot
plt.tight_layout()
plt.show()

Step 3. As we mentioned in previous chapters, visualization is not enough to accurately determine the type of distribution. Therefore, after visualization, various statistical criteria are usually applied to more formally show that our data belongs to one or another general population;

Step 4. After you have determined the type of distribution, you need to estimate the parameters of this distribution. For example, if you assume from the histogram that the data is distributed normally, then you need to estimate the mean value and the variance; if you assume that the data is distributed exponentially, then you need to determine the lambda parameter, and so on. In addition to point estimation of parameters, confidence intervals are also constructed for the corresponding parameters.

In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu


              123456789
            
import numpy as np
# Specify parameters of general population
mean = 70
std = 3
# Specify number of sumples to generate
size = 10
# Generate samples
samples = np.random.normal(mean, std, size)
print('Samples are: ', samples)

To do this, we need to follow the following list of steps:

Step 1. Firstly it is necessary to determine whether we are dealing with a discrete or continuous general population;


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
import matplotlib.pyplot as plt

# Generating 1000 samples from a continuous normal distribution with mean 70 and standard deviation 3
samples_cont = np.random.normal(70, 3, 1000)
# Generate 500 samples from a discrete distribution
samples_disc = np.random.choice(['Red', 'Blue', 'Green', 'Black', 'White'], size=500, p=[0.3, 0.2, 0.15, 0.15, 0.2]) 

# Creating the figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plotting the histogram on the first subplot
axes[0].hist(samples_cont, bins=20, alpha=0.5, color='blue', density=True)
axes[0].set_xlabel('Values')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Continuous Variable')

# Plotting the frequency polygon on the second subplot

# Calculate the empirical probabilities
counts = np.unique(samples_disc, return_counts=True)[1]
probs = counts / len(samples_disc)

# Plot the frequency polygon
axes[1].plot(['Red', 'Blue', 'Green', 'Black', 'White'], probs, marker='o', linestyle='--')
axes[1].set_title('Frequency Polygon')
axes[1].set_xlabel('Color')
axes[1].set_ylabel('Estimated Probability')

# Adjusting the layout and displaying the plot
plt.tight_layout()
plt.show()

In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

General population. Samples. Population parameters.

To do this, we need to follow the following list of steps:

Awesome!

General population. Samples. Population parameters.

To do this, we need to follow the following list of steps: