Summary  
This chapter covers computing descriptive statistics such as mean, median, mode, and standard deviation using code, as well as detecting outliers by comparing data points to statistical thresholds.

General domain of usage  
Environmental data analysis

Descriptive statistics provide a foundation for understanding the main characteristics of environmental datasets. In environmental science, you often work with variables such as pollutant concentrations, temperature, or rainfall, which are measured repeatedly over time or across different locations. Key descriptive statistics include the **mean** (average value), **median** (middle value when sorted), **mode** (most frequently occurring value), and **standard deviation** (a measure of how spread out the values are). These statistics help you quickly summarize the central tendency and variability of data, which is essential for monitoring environmental quality, detecting unusual events, and informing policy decisions.

import pandas as pd

# Example pollutant concentration data (in micrograms per cubic meter)
data = {
    "PM2.5": [12, 15, 14, 16, 18, 120, 13, 15, 14, 13],
    "NO2": [22, 21, 19, 24, 23, 22, 20, 100, 21, 22]
}

df = pd.DataFrame(data)

# Calculate descriptive statistics
mean_pm25 = df["PM2.5"].mean()
median_pm25 = df["PM2.5"].median()
mode_pm25 = df["PM2.5"].mode()[0]
std_pm25 = df["PM2.5"].std()

print("PM2.5 Mean:", mean_pm25)
print("PM2.5 Median:", median_pm25)
print("PM2.5 Mode:", mode_pm25)
print("PM2.5 Standard Deviation:", std_pm25)

Looking at the calculated statistics for the `PM2.5` pollutant, you can see how each value describes a different aspect of the data. The **mean** gives the average concentration, which is helpful for understanding the typical level of pollution. The **median** is less affected by extreme values, so it often represents the "typical" value more accurately when outliers are present. The **mode** can highlight the most common pollution level if certain readings occur more frequently. The **standard deviation** indicates how much the pollution levels vary from the mean; a high standard deviation suggests that there are large fluctuations or outliers in the dataset, which could signal occasional pollution spikes or measurement errors.

# Identifying outliers in PM2.5 using standard deviation
mean = df["PM2.5"].mean()
std = df["PM2.5"].std()

# Outliers are values more than 2 standard deviations from the mean
outliers = df[(df["PM2.5"] > mean + 2*std) | (df["PM2.5"] < mean - 2*std)]
print("Outliers in PM2.5:")
print(outliers)

What does the standard deviation tell you about an environmental dataset?

Which pandas method provides a summary of descriptive statistics for a DataFrame?

Explore how Python can be leveraged to address real-world environmental science problems. This course guides students through data analysis, visualization, and modeling techniques relevant to environmental research, using hands-on tasks and engaging theory chapters.

Learn how to access, clean, and explore environmental datasets using Python. Gain foundational skills for working with real-world environmental data.

Delve into statistical techniques for analyzing environmental data, including descriptive statistics, correlation, and hypothesis testing.

Apply Python to model and predict environmental processes, such as pollution dispersion and climate trends, using real datasets.

Descriptive Statistics for Environmental Data

1. What does the standard deviation tell you about an environmental dataset?

2. Which pandas method provides a summary of descriptive statistics for a DataFrame?

3. Fill in the blank: To find the median of a column `PM2.5` in `df`, use `df['PM2.5'].____()`.

Descriptive Statistics for Environmental Data

1. What does the standard deviation tell you about an environmental dataset?

2. Which pandas method provides a summary of descriptive statistics for a DataFrame?

3. Fill in the blank: To find the median of a column PM2.5 in df, use df['PM2.5'].____().

3. Fill in the blank: To find the median of a column `PM2.5` in `df`, use `df['PM2.5'].____()`.