Вивчайте Dimensionality Reduction Evaluation

Dimensionality reduction is a fundamental technique in unsupervised learning, often used to simplify datasets, visualize high-dimensional data, or improve the efficiency of downstream models. However, reducing the number of features can lead to the loss of important information. This makes it essential to evaluate how well a dimensionality reduction method, such as Principal Component Analysis (PCA), preserves the original data structure. Without proper evaluation, you risk discarding valuable patterns or introducing distortions that could impact subsequent analyses or model performance.

To assess the effectiveness of dimensionality reduction, two commonly used metrics are reconstruction error and explained variance ratio. These metrics quantify, respectively, how much information is lost and how much of the original data’s variability is retained. Their mathematical definitions are:

Reconstruction Error (for PCA):

\text{Reconstruction Error} = \frac{1}{n} \sum_{i=1}^{n} \| \mathbf{x}_i - \hat{\mathbf{x}}_i \|^2

where $\mathbf{x}_i$ is the original data point, $\hat{\mathbf{x}}_i$ is the reconstructed data point after dimensionality reduction and inverse transformation, and $n$ is the number of samples.

Explained Variance Ratio (for PCA component $k$ ):

\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_{j=1}^{p} \lambda_j}

where $\lambda_k$ is the variance explained by the $k$ -th principal component, and $p$ is the total number of components in the original data.


              123456789101112131415161718192021222324
            
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load a sample dataset
data = load_iris()
X = data.data

# Fit PCA with 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Reconstruct the data from the reduced version
X_reconstructed = pca.inverse_transform(X_reduced)

# Compute explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
total_explained = np.sum(explained_variance_ratio)
print(f"Explained variance ratio per component: {explained_variance_ratio}")
print(f"Total explained variance (2 components): {total_explained:.4f}")

# Compute mean squared reconstruction error
reconstruction_error = np.mean(np.square(X - X_reconstructed))
print(f"Mean squared reconstruction error: {reconstruction_error:.4f}")

When interpreting these metrics, a higher explained variance ratio indicates that the chosen components capture more of the original data’s variability, meaning less information is lost. For instance, if two principal components explain 95% of the variance, you retain most of the data’s structure in a lower-dimensional space. Conversely, a lower reconstruction error means that the reduced data, when mapped back to the original space, closely matches the original data, reflecting minimal information loss. Both metrics are crucial: explained_variance_ratio helps you decide how many components to keep, while reconstruction_error quantifies the fidelity of the dimensionality reduction. In practical terms, you should aim for a balance—retaining as much variance as necessary for your task while keeping the reconstruction error acceptably low.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 3

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Awesome!

Completion rate improved to 6.25

Свайпніть щоб показати меню