Learn Clustering Environmental Data | Modeling and Predicting Environmental Phenomena

Swipe to show menu

Clustering is a powerful technique in environmental science for discovering hidden patterns and natural groupings in complex datasets. By grouping similar data points together, clustering helps you make sense of large volumes of environmental data, such as identifying areas with similar pollution levels or grouping monitoring stations with comparable air quality readings. This can lead to better decision-making, targeted interventions, and a deeper understanding of environmental phenomena. For instance, you might use clustering to group river sampling locations based on measured levels of contaminants, or to segment regions by climate characteristics.


              12345678910111213141516171819
            
import pandas as pd
from sklearn.cluster import KMeans

# Sample environmental data: monitoring stations and their average PM2.5 and NO2 levels
data = {
    "Station": ["A", "B", "C", "D", "E", "F"],
    "PM2.5": [12, 35, 14, 40, 13, 38],
    "NO2": [22, 55, 25, 60, 20, 58]
}
df = pd.DataFrame(data)

# Prepare features for clustering (excluding the Station name)
X = df[["PM2.5", "NO2"]]

# Create and fit a k-means model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0)
df["Cluster"] = kmeans.fit_predict(X)

print(df)

After fitting the k-means clustering model, each monitoring station is assigned to a cluster based on its pollution measurements. Interpreting these cluster assignments involves examining which stations are grouped together and what their pollution levels have in common. For example, you may find that one cluster contains stations with higher PM2.5 and NO2 levels, while the other cluster includes stations with lower levels. To make these groupings more intuitive, you can visualize the clusters on a scatter plot, coloring each point by its assigned cluster. This helps you quickly see the separation between groups and identify any outliers or interesting patterns in the data.


              12345678910111213
            
import matplotlib.pyplot as plt

# Scatter plot of PM2.5 vs NO2, colored by cluster
plt.figure(figsize=(6, 4))
for cluster in df["Cluster"].unique():
    cluster_data = df[df["Cluster"] == cluster]
    plt.scatter(cluster_data["PM2.5"], cluster_data["NO2"], label=f"Cluster {cluster}")

plt.xlabel("PM2.5")
plt.ylabel("NO2")
plt.title("Monitoring Stations Clustered by Pollution Levels")
plt.legend()
plt.show()

1. What is the goal of clustering in environmental data analysis?

2. Which scikit-learn class is used for k-means clustering?

3. Fill in the blank: To fit a k-means model, use kmeans.____(X).

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 5