Desliza para mostrar el menú

Sampling methods are essential in data analysis because they allow you to draw meaningful conclusions from large or complex datasets without analyzing every single data point. By carefully selecting a representative subset, you can efficiently estimate trends, test hypotheses, and make predictions while saving time and computational resources. Proper sampling reduces bias, improves accuracy, and ensures that your analysis reflects the true characteristics of the entire population. Understanding and applying the right sampling strategy is a foundational skill for any data analyst working with Python.

Simple Random Sampling

Simple random sampling is a fundamental technique in data analysis that involves selecting a subset of data from a larger population, where every item has an equal chance of being chosen. This method ensures that the sample is unbiased and representative of the whole population, which is critical for making valid statistical inferences.

Definition

Simple random sampling is a process where each member of the population has an equal probability of being selected for the sample. The selection is entirely by chance, often using random number generators or shuffling methods.

Purpose

Ensure each data point has an equal chance of selection;
Reduce bias in the sampling process;
Provide a reliable basis for statistical inference and hypothesis testing;
Simplify the process of analyzing large datasets by working with manageable, representative samples.

Example: Simple Random Sampling in Python

Suppose you have a dataset containing sales records for a retail store, and you want to analyze a random subset of 100 transactions to estimate average sales. You can use the pandas library to perform simple random sampling.


              1234567891011121314
            
import pandas as pd

# Simulate a DataFrame with 1,000 sales records
sales_data = pd.DataFrame({
    'transaction_id': range(1, 1001),
    'amount': pd.np.random.uniform(10, 500, 1000)  # Random sales amounts
})

# Draw a simple random sample of 100 transactions
sample = sales_data.sample(n=100, random_state=42)

# Display the first five rows of the sample
print(f'Dataset shape: {data.shape}')
print(f'Sample shape: {sample.shape}')

In this example:

The sample method randomly selects 100 rows from the sales_data DataFrame;
The random_state parameter ensures reproducibility by setting a seed for the random number generator;
The resulting sample can be used to estimate average sales or perform further analysis.

Simple random sampling is a reliable first step in many data analysis workflows, helping you draw conclusions about a population based on a manageable and unbiased subset of data.

Stratified Sampling

Stratified sampling is a data analysis strategy that divides a population into distinct subgroups, called strata, based on shared characteristics. You then randomly sample from each stratum in proportion to its size in the overall population. This ensures that every subgroup is adequately represented in your sample, which leads to more accurate and reliable analysis results.

When to Use Stratified Sampling

Use stratified sampling when:

Your population contains distinct subgroups that may differ in important ways;
You want to ensure that each subgroup is represented proportionally in your sample;
The characteristic used for stratification is relevant to your analysis and could influence the outcome;
Simple random sampling might miss or underrepresent smaller but important subgroups.

Common scenarios:

Surveying customers from different regions, age groups, or income levels;
Ensuring representation of minority groups in research studies;
Analyzing product feedback across multiple product lines.

Python Example: Stratified Sampling with scikit-learn

Suppose you have a dataset of students with their gender and test scores, and you want to sample 40% of the data while preserving the gender ratio.


              1234567891011121314151617181920
            
import pandas as pd
from sklearn.model_selection import train_test_split

# Create a sample DataFrame
data = {
    'student_id': range(1, 11),
    'gender': ['male', 'female', 'female', 'male', 'female', 'male', 'female', 'male', 'female', 'male'],
    'score': [85, 90, 78, 88, 92, 75, 80, 89, 95, 84]
}
df = pd.DataFrame(data)

# Perform stratified sampling based on 'gender'
stratified_sample, _ = train_test_split(
    df,
    test_size=0.6,  # Keep 40%
    stratify=df['gender'],
    random_state=42
)

print(stratified_sample)

Key points:

The stratify parameter ensures that the sampled DataFrame preserves the original gender proportions;
Use stratified sampling whenever subgroup representation is critical for your analysis validity.

Cluster Sampling

Cluster sampling is a probability sampling technique used when it is difficult or costly to collect data from an entire population. In cluster sampling, you divide the population into separate groups called clusters. You then randomly select some clusters and collect data from every member within those clusters.

Definition

Cluster sampling involves:

Dividing the population into distinct, non-overlapping clusters;
Randomly selecting a subset of these clusters;
Including all individuals from the chosen clusters in the sample.

This method is especially useful when the population is geographically dispersed or when a complete list of all members is unavailable.

Advantages

Reduces travel and administrative costs when the population is spread out geographically;
Simplifies data collection by focusing on groups rather than individuals;
Useful when a complete list of all population members is difficult to obtain;
Can be more practical and efficient than simple random sampling in certain scenarios.

Cluster Sampling Example in Python

Suppose you want to survey students in a school district consisting of several schools. Each school represents a cluster. You randomly select a few schools and survey all students in those schools.

Here is a Python example using pandas and numpy:


              1234567891011121314151617181920212223
            
import pandas as pd
import numpy as np

# Create a mock dataset: 5 schools, each with 10 students
np.random.seed(42)
schools = ['School_A', 'School_B', 'School_C', 'School_D', 'School_E']
data = {
    'student_id': range(1, 51),
    'school': np.repeat(schools, 10),
    'score': np.random.randint(60, 100, 50)
}
df = pd.DataFrame(data)

# Step 1: List all unique clusters (schools)
unique_schools = df['school'].unique()

# Step 2: Randomly select 2 clusters (schools)
selected_schools = np.random.choice(unique_schools, size=2, replace=False)
print(f"Selected clusters: {selected_schools}")

# Step 3: Select all students from the chosen clusters
cluster_sample = df[df['school'].isin(selected_schools)]
print(cluster_sample)

This code:

Creates a sample dataset of students with their school and scores;
Randomly selects two schools as clusters;
Includes all students from those selected schools in the sample.

Cluster sampling is a practical approach when direct sampling is not feasible, and it helps you efficiently gather representative data from large or dispersed populations.

Systematic Sampling

Systematic sampling is a probability sampling method where you select items from an ordered population at regular intervals. This approach is especially useful when you want a simple, repeatable way to sample without introducing bias from random selection.

Definition

Systematic sampling involves choosing every kth element from a list or sequence, where k is a fixed interval calculated based on the desired sample size. You start from a randomly chosen starting point within the first k elements, then continue selecting every kth item.

Process

Arrange the population in a list or sequence;
Decide the sample size you want;
Calculate the sampling interval k by dividing the population size by the desired sample size;
Randomly select a starting point between 0 and k-1;
Select every kth item from the starting point until you reach your sample size.

This method ensures that the sample is spread evenly across the population, which can help capture trends or patterns that might be missed with purely random sampling.

Python Example: Systematic Sampling

Here is how you can perform systematic sampling in Python using only built-in features and the numpy library.


              123456789101112131415161718
            
import numpy as np

# Example population: 100 integers
population = np.arange(1, 101)

sample_size = 10
population_size = len(population)

# Calculate sampling interval
k = population_size // sample_size

# Randomly select a starting point
start = np.random.randint(0, k)

# Select every k-th item
systematic_sample = population[start::k][:sample_size]

print("Systematic sample:", systematic_sample)

This code generates a sample of 10 numbers from a population of 100 by selecting every 10th item, starting from a randomly chosen index within the first interval. The result is a sample that is evenly distributed across the entire population.

Multi-stage Sampling

Multi-stage sampling is a complex sampling technique that involves selecting samples in multiple steps or stages, rather than drawing all samples at once. This approach is useful when dealing with large, geographically dispersed populations or when a complete list of all population members is unavailable.

Concept

Multi-stage sampling combines several sampling methods, such as cluster sampling and simple random sampling, in successive steps;
At each stage, you select groups (clusters) or individuals from the previous stage's selection;
This method reduces cost and effort, especially for large-scale surveys.

Application

Suppose you want to survey households in a country:

Select a random sample of regions (first stage);
Within each selected region, randomly choose cities (second stage);
Within each city, randomly select neighborhoods (third stage);
Finally, randomly select households within each neighborhood (fourth stage).

This approach lets you efficiently manage resources while ensuring the sample represents the target population.

Python Example: Simulating Multi-stage Sampling

Suppose you have data for three regions, each with five cities, and each city contains 100 households. You want to sample:

2 regions;
2 cities from each selected region;
5 households from each selected city.


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import pandas as pd

# Create synthetic population data
data = []
for region in range(1, 4):
    for city in range(1, 6):
        for household in range(1, 101):
            data.append({
                "region": f"Region_{region}",
                "city": f"City_{city}",
                "household_id": f"H{region}{city}{household}"
            })
population = pd.DataFrame(data)

# First stage: sample 2 regions
sampled_regions = np.random.choice(population['region'].unique(), 2, replace=False)

# Second stage: sample 2 cities from each selected region
sampled_cities = []
for region in sampled_regions:
    cities = population[population['region'] == region]['city'].unique()
    sampled_cities.extend([(region, city) for city in np.random.choice(cities, 2, replace=False)])

# Third stage: sample 5 households from each selected city
sampled_households = []
for region, city in sampled_cities:
    households = population[(population['region'] == region) & (population['city'] == city)]
    sampled_households.append(households.sample(5, random_state=42))

# Concatenate all sampled households
df_sample = pd.concat(sampled_households)
print(df_sample[['region', 'city', 'household_id']])

This code demonstrates how to implement multi-stage sampling using numpy and pandas. You first select regions, then cities within those regions, and finally households within those cities, resulting in a manageable and representative sample.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 22

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sampling Methods

Simple Random Sampling

Definition

Purpose

Ensure each data point has an equal chance of selection;
Reduce bias in the sampling process;
Provide a reliable basis for statistical inference and hypothesis testing;
Simplify the process of analyzing large datasets by working with manageable, representative samples.

Example: Simple Random Sampling in Python


              1234567891011121314
            
import pandas as pd

# Simulate a DataFrame with 1,000 sales records
sales_data = pd.DataFrame({
    'transaction_id': range(1, 1001),
    'amount': pd.np.random.uniform(10, 500, 1000)  # Random sales amounts
})

# Draw a simple random sample of 100 transactions
sample = sales_data.sample(n=100, random_state=42)

# Display the first five rows of the sample
print(f'Dataset shape: {data.shape}')
print(f'Sample shape: {sample.shape}')

In this example:

The sample method randomly selects 100 rows from the sales_data DataFrame;
The random_state parameter ensures reproducibility by setting a seed for the random number generator;
The resulting sample can be used to estimate average sales or perform further analysis.

Simple random sampling is a reliable first step in many data analysis workflows, helping you draw conclusions about a population based on a manageable and unbiased subset of data.

Stratified Sampling

When to Use Stratified Sampling

Use stratified sampling when:

Your population contains distinct subgroups that may differ in important ways;
You want to ensure that each subgroup is represented proportionally in your sample;
The characteristic used for stratification is relevant to your analysis and could influence the outcome;
Simple random sampling might miss or underrepresent smaller but important subgroups.

Common scenarios:

Surveying customers from different regions, age groups, or income levels;
Ensuring representation of minority groups in research studies;
Analyzing product feedback across multiple product lines.

Python Example: Stratified Sampling with scikit-learn

Suppose you have a dataset of students with their gender and test scores, and you want to sample 40% of the data while preserving the gender ratio.


              1234567891011121314151617181920
            
import pandas as pd
from sklearn.model_selection import train_test_split

# Create a sample DataFrame
data = {
    'student_id': range(1, 11),
    'gender': ['male', 'female', 'female', 'male', 'female', 'male', 'female', 'male', 'female', 'male'],
    'score': [85, 90, 78, 88, 92, 75, 80, 89, 95, 84]
}
df = pd.DataFrame(data)

# Perform stratified sampling based on 'gender'
stratified_sample, _ = train_test_split(
    df,
    test_size=0.6,  # Keep 40%
    stratify=df['gender'],
    random_state=42
)

print(stratified_sample)

Key points:

The stratify parameter ensures that the sampled DataFrame preserves the original gender proportions;
Use stratified sampling whenever subgroup representation is critical for your analysis validity.

Cluster Sampling

Definition

Cluster sampling involves:

Dividing the population into distinct, non-overlapping clusters;
Randomly selecting a subset of these clusters;
Including all individuals from the chosen clusters in the sample.

This method is especially useful when the population is geographically dispersed or when a complete list of all members is unavailable.

Advantages

Reduces travel and administrative costs when the population is spread out geographically;
Simplifies data collection by focusing on groups rather than individuals;
Useful when a complete list of all population members is difficult to obtain;
Can be more practical and efficient than simple random sampling in certain scenarios.

Cluster Sampling Example in Python

Suppose you want to survey students in a school district consisting of several schools. Each school represents a cluster. You randomly select a few schools and survey all students in those schools.

Here is a Python example using pandas and numpy:


              1234567891011121314151617181920212223
            
import pandas as pd
import numpy as np

# Create a mock dataset: 5 schools, each with 10 students
np.random.seed(42)
schools = ['School_A', 'School_B', 'School_C', 'School_D', 'School_E']
data = {
    'student_id': range(1, 51),
    'school': np.repeat(schools, 10),
    'score': np.random.randint(60, 100, 50)
}
df = pd.DataFrame(data)

# Step 1: List all unique clusters (schools)
unique_schools = df['school'].unique()

# Step 2: Randomly select 2 clusters (schools)
selected_schools = np.random.choice(unique_schools, size=2, replace=False)
print(f"Selected clusters: {selected_schools}")

# Step 3: Select all students from the chosen clusters
cluster_sample = df[df['school'].isin(selected_schools)]
print(cluster_sample)

This code:

Creates a sample dataset of students with their school and scores;
Randomly selects two schools as clusters;
Includes all students from those selected schools in the sample.

Cluster sampling is a practical approach when direct sampling is not feasible, and it helps you efficiently gather representative data from large or dispersed populations.

Systematic Sampling

Definition

Process

Arrange the population in a list or sequence;
Decide the sample size you want;
Calculate the sampling interval k by dividing the population size by the desired sample size;
Randomly select a starting point between 0 and k-1;
Select every kth item from the starting point until you reach your sample size.

This method ensures that the sample is spread evenly across the population, which can help capture trends or patterns that might be missed with purely random sampling.

Python Example: Systematic Sampling

Here is how you can perform systematic sampling in Python using only built-in features and the numpy library.


              123456789101112131415161718
            
import numpy as np

# Example population: 100 integers
population = np.arange(1, 101)

sample_size = 10
population_size = len(population)

# Calculate sampling interval
k = population_size // sample_size

# Randomly select a starting point
start = np.random.randint(0, k)

# Select every k-th item
systematic_sample = population[start::k][:sample_size]

print("Systematic sample:", systematic_sample)

Multi-stage Sampling

Concept

Multi-stage sampling combines several sampling methods, such as cluster sampling and simple random sampling, in successive steps;
At each stage, you select groups (clusters) or individuals from the previous stage's selection;
This method reduces cost and effort, especially for large-scale surveys.

Application

Suppose you want to survey households in a country:

Select a random sample of regions (first stage);
Within each selected region, randomly choose cities (second stage);
Within each city, randomly select neighborhoods (third stage);
Finally, randomly select households within each neighborhood (fourth stage).

This approach lets you efficiently manage resources while ensuring the sample represents the target population.

Python Example: Simulating Multi-stage Sampling

Suppose you have data for three regions, each with five cities, and each city contains 100 households. You want to sample:

2 regions;
2 cities from each selected region;
5 households from each selected city.


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import pandas as pd

# Create synthetic population data
data = []
for region in range(1, 4):
    for city in range(1, 6):
        for household in range(1, 101):
            data.append({
                "region": f"Region_{region}",
                "city": f"City_{city}",
                "household_id": f"H{region}{city}{household}"
            })
population = pd.DataFrame(data)

# First stage: sample 2 regions
sampled_regions = np.random.choice(population['region'].unique(), 2, replace=False)

# Second stage: sample 2 cities from each selected region
sampled_cities = []
for region in sampled_regions:
    cities = population[population['region'] == region]['city'].unique()
    sampled_cities.extend([(region, city) for city in np.random.choice(cities, 2, replace=False)])

# Third stage: sample 5 households from each selected city
sampled_households = []
for region, city in sampled_cities:
    households = population[(population['region'] == region) & (population['city'] == city)]
    sampled_households.append(households.sample(5, random_state=42))

# Concatenate all sampled households
df_sample = pd.concat(sampled_households)
print(df_sample[['region', 'city', 'household_id']])

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 22