Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Performing PCA on a Real Dataset | Implementing PCA in Python
Dimensionality Reduction with PCA

bookPerforming PCA on a Real Dataset

Perform PCA on a real dataset using scikit-learn. Use the Iris dataset, a classic in machine learning, and follow these steps:

  • Load the data;
  • Prepare it for analysis;
  • Standardize features;
  • Apply PCA to reduce its dimensionality.

This process demonstrates how to implement dimensionality reduction in practical scenarios.

12345678910111213141516171819202122
import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Load the Iris dataset data = load_iris() X = data.data feature_names = data.feature_names # Standardize features (important for PCA) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply PCA to reduce to 2 components pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print("Original shape:", X.shape) print("Transformed shape:", X_pca.shape) # Each row in X_pca is a sample projected onto the first two principal components
copy

The code above performs PCA on the Iris dataset by following several key steps:

1. Loading the Data

The Iris dataset is loaded using load_iris() from scikit-learn. This dataset contains 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, petal width.

2. Standardizing Features

Standardization ensures each feature has mean 0 and variance 1:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

This step is essential because PCA is sensitive to the variance of each feature. Without standardization, features with larger scales would dominate the principal components, leading to misleading results.

3. Applying PCA

PCA(n_components=2) reduces the dataset from four dimensions to two:

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Principal components are new axes that capture the directions of maximum variance in the data. Each sample is projected onto these axes, resulting in a compact representation that retains as much information as possible.

4. Interpreting PCA Output

You can check how much variance each principal component explains:

print(pca.explained_variance_ratio_)

This outputs an array, such as [0.7277, 0.2303], meaning the first component explains about 73% of the variance and the second about 23%. Together, they capture most of the information from the original data.

question mark

Which statement is correct about performing PCA on the Iris dataset as shown in the example?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookPerforming PCA on a Real Dataset

Swipe to show menu

Perform PCA on a real dataset using scikit-learn. Use the Iris dataset, a classic in machine learning, and follow these steps:

  • Load the data;
  • Prepare it for analysis;
  • Standardize features;
  • Apply PCA to reduce its dimensionality.

This process demonstrates how to implement dimensionality reduction in practical scenarios.

12345678910111213141516171819202122
import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Load the Iris dataset data = load_iris() X = data.data feature_names = data.feature_names # Standardize features (important for PCA) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply PCA to reduce to 2 components pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print("Original shape:", X.shape) print("Transformed shape:", X_pca.shape) # Each row in X_pca is a sample projected onto the first two principal components
copy

The code above performs PCA on the Iris dataset by following several key steps:

1. Loading the Data

The Iris dataset is loaded using load_iris() from scikit-learn. This dataset contains 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, petal width.

2. Standardizing Features

Standardization ensures each feature has mean 0 and variance 1:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

This step is essential because PCA is sensitive to the variance of each feature. Without standardization, features with larger scales would dominate the principal components, leading to misleading results.

3. Applying PCA

PCA(n_components=2) reduces the dataset from four dimensions to two:

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Principal components are new axes that capture the directions of maximum variance in the data. Each sample is projected onto these axes, resulting in a compact representation that retains as much information as possible.

4. Interpreting PCA Output

You can check how much variance each principal component explains:

print(pca.explained_variance_ratio_)

This outputs an array, such as [0.7277, 0.2303], meaning the first component explains about 73% of the variance and the second about 23%. Together, they capture most of the information from the original data.

question mark

Which statement is correct about performing PCA on the Iris dataset as shown in the example?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1
some-alt