Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Understading Sampling | Probability & Statistics
Mathematics for Data Science

bookUnderstading Sampling

Note
Definition

Sampling is the process of selecting a subset of data from a larger population to gain insights and make inferences about the whole. Since it is often impractical or impossible to collect data from an entire population, sampling allows for efficient analysis while maintaining the quality and accuracy of the results.

Simple Random Sampling

Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.

P(SelectΒ anyΒ individual)=1NP(\text{Select any individual}) = \frac{1}{N}

Where:

  • NN = population size.

Example 1:

You have a class of 30 students. You want to randomly select 5 for a survey.

Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 130\tfrac{\raisebox{1pt}{$1$}}{\raisebox{-1pt}{$30$}} chance of being selected.

Example 2:

You have a class of 30 students and want to select 5 to participate in a survey.

  • Total population: N=30N=30;
  • Sample size: n=5n=5.

What is the probability that Alice and Bob are both selected?

Total number of ways to choose 5 students from 30:

(305)\binom{30}{5}

Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob β€” choose 3 more from the remaining 28:

(283)\binom{28}{3}

So the probability is:

P=(283)(305)P = \frac{\binom{28}{3}}{\binom{30}{5}}

Stratified Sampling

The population is divided into meaningful subgroups (strata), and random samples are taken from each.

nh=NhNΓ—nn_h = \frac{N_h}{N} \times n

Where:

  • NhN_h - size of subgroup hh;
  • NN - total population size;
  • nn - total sample size;
  • nhn_{\raisebox{-1pt}{$h$}} - sample size from subgroup hh.

Example:

A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:

  • From males: 1830Γ—10=6\tfrac{\raisebox{1pt}{$18$}}{\raisebox{-1pt}{$30$}} \times 10 = 6;
  • From females: 1230Γ—10=4\tfrac{\raisebox{1pt}{$12$}}{\raisebox{-1pt}{$30$}} \times 10 = 4.

Why it's good: Ensures representation of key subgroups.

Cluster Sampling

The population is split into groups (clusters), and entire clusters are randomly selected.

c=numberΒ ofΒ clustersΒ toΒ samplec = \text{number of clusters to sample}

Where:

  • Clusters are pre-existing groups (e.g., classrooms, teams);
  • You randomly pick entire clusters, not individuals.

Example 1:

Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.

Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.

Example 2:

A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.

  • Number of clusters: N=20N=20;
  • Selected clusters: n=4n=4;
  • Students per dorm: M=50M=50;
  • Total students sampled: nΓ—M=200n \times M = 200.

What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:

P(SarahΒ selected)=420=0.2P(\text{Sarah selected}) = \frac{4}{20} = 0.2

Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?

Let:

  • D30=10D_{30} = 10 dorms with 30 students;
  • D70=10D_{70} = 10 dorms with 70 students.

Expected sample size:

E=1020β‹…(4Γ—30)+1020β‹…(4Γ—70)=200E = \frac{10}{20} \cdot (4 \times 30) + \frac{10}{20} \cdot (4 \times 70) = 200

So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.

Systematic Sampling

Select every kk-th item from a list.

k=Nnk = \frac{N}{n}

Where:

  • NN - total population;
  • nn - sample size desired;
  • kk - sampling interval.

Example:

A list of 1000 customers. You want a sample of 100. So:

k=1000100=10k = \frac{1000}{100} = 10

Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.

Why it's good: Easy to implement and systematic.

All Methods Applied to One Problem

Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students.

  • Simple random: randomly pick 30 names from the full list;
  • Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
  • Cluster: randomly select 1 class (30 students) and survey all;
  • Systematic: pick every 10th student from an ordered list.

Summary

  • Sampling reduces data collection effort while allowing generalization;
  • Random and stratified sampling are best for accuracy;
  • Cluster sampling is efficient but works best when clusters are similar;
  • Systematic sampling is simple and practical;
  • Convenience sampling is risky and should be avoided when possible;
  • Always document your sampling method in real-world analysis.
question mark

Which method ensures every individual has an equal chance of selection?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the differences between these sampling methods in more detail?

When should I use each sampling method?

Can you provide more real-world examples for each sampling method?

Awesome!

Completion rate improved to 1.96

bookUnderstading Sampling

Swipe to show menu

Note
Definition

Sampling is the process of selecting a subset of data from a larger population to gain insights and make inferences about the whole. Since it is often impractical or impossible to collect data from an entire population, sampling allows for efficient analysis while maintaining the quality and accuracy of the results.

Simple Random Sampling

Every member of the population has an equal chance of being selected.
This is like drawing names out of a hat.

P(SelectΒ anyΒ individual)=1NP(\text{Select any individual}) = \frac{1}{N}

Where:

  • NN = population size.

Example 1:

You have a class of 30 students. You want to randomly select 5 for a survey.

Solution: Use a random number generator to select 5 unique numbers between 1 and 30. Each student has a 130\tfrac{\raisebox{1pt}{$1$}}{\raisebox{-1pt}{$30$}} chance of being selected.

Example 2:

You have a class of 30 students and want to select 5 to participate in a survey.

  • Total population: N=30N=30;
  • Sample size: n=5n=5.

What is the probability that Alice and Bob are both selected?

Total number of ways to choose 5 students from 30:

(305)\binom{30}{5}

Number of favorable samples containing both Alice and Bob:
Fix Alice and Bob β€” choose 3 more from the remaining 28:

(283)\binom{28}{3}

So the probability is:

P=(283)(305)P = \frac{\binom{28}{3}}{\binom{30}{5}}

Stratified Sampling

The population is divided into meaningful subgroups (strata), and random samples are taken from each.

nh=NhNΓ—nn_h = \frac{N_h}{N} \times n

Where:

  • NhN_h - size of subgroup hh;
  • NN - total population size;
  • nn - total sample size;
  • nhn_{\raisebox{-1pt}{$h$}} - sample size from subgroup hh.

Example:

A class has 30 students: 18 males and 12 females. You want to sample 10 students proportionally:

  • From males: 1830Γ—10=6\tfrac{\raisebox{1pt}{$18$}}{\raisebox{-1pt}{$30$}} \times 10 = 6;
  • From females: 1230Γ—10=4\tfrac{\raisebox{1pt}{$12$}}{\raisebox{-1pt}{$30$}} \times 10 = 4.

Why it's good: Ensures representation of key subgroups.

Cluster Sampling

The population is split into groups (clusters), and entire clusters are randomly selected.

c=numberΒ ofΒ clustersΒ toΒ samplec = \text{number of clusters to sample}

Where:

  • Clusters are pre-existing groups (e.g., classrooms, teams);
  • You randomly pick entire clusters, not individuals.

Example 1:

Your school has 5 classrooms. You want a sample of 25 students, but surveying individuals is too time-consuming.

Solution: Randomly select 1 classroom (since each has ~25 students) and survey all.

Example 2:

A university has 20 dorm buildings, each housing 50 students. You randomly select 4 dorms and survey everyone inside.

  • Number of clusters: N=20N=20;
  • Selected clusters: n=4n=4;
  • Students per dorm: M=50M=50;
  • Total students sampled: nΓ—M=200n \times M = 200.

What's the probability that a specific student (e.g., Sarah) is included?
It equals the probability that her dorm is selected:

P(SarahΒ selected)=420=0.2P(\text{Sarah selected}) = \frac{4}{20} = 0.2

Complex case:
If 10 dorms have 30 students and 10 have 70 students, and you select 4 dorms randomly, what's the expected sample size?

Let:

  • D30=10D_{30} = 10 dorms with 30 students;
  • D70=10D_{70} = 10 dorms with 70 students.

Expected sample size:

E=1020β‹…(4Γ—30)+1020β‹…(4Γ—70)=200E = \frac{10}{20} \cdot (4 \times 30) + \frac{10}{20} \cdot (4 \times 70) = 200

So even if clusters differ in size, the expected sample size remains the same if dorm types are balanced.

Systematic Sampling

Select every kk-th item from a list.

k=Nnk = \frac{N}{n}

Where:

  • NN - total population;
  • nn - sample size desired;
  • kk - sampling interval.

Example:

A list of 1000 customers. You want a sample of 100. So:

k=1000100=10k = \frac{1000}{100} = 10

Pick a random start point (e.g., 7), then select every 10th customer: 7, 17, 27, etc.

Why it's good: Easy to implement and systematic.

All Methods Applied to One Problem

Problem Setup:
You're studying cafeteria satisfaction at a school with 300 students across 10 classrooms (30 per room). You want a sample of 30 students.

  • Simple random: randomly pick 30 names from the full list;
  • Stratified: if 60% are boys and 40% girls, sample 18 boys and 12 girls;
  • Cluster: randomly select 1 class (30 students) and survey all;
  • Systematic: pick every 10th student from an ordered list.

Summary

  • Sampling reduces data collection effort while allowing generalization;
  • Random and stratified sampling are best for accuracy;
  • Cluster sampling is efficient but works best when clusters are similar;
  • Systematic sampling is simple and practical;
  • Convenience sampling is risky and should be avoided when possible;
  • Always document your sampling method in real-world analysis.
question mark

Which method ensures every individual has an equal chance of selection?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 5
some-alt