Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Dimensionality Reduction Evaluation | Unsupervised Learning Metrics
Evaluation Metrics in Machine Learning

bookDimensionality Reduction Evaluation

Dimensionality reduction is a fundamental technique in unsupervised learning, often used to simplify datasets, visualize high-dimensional data, or improve the efficiency of downstream models. However, reducing the number of features can lead to the loss of important information. This makes it essential to evaluate how well a dimensionality reduction method, such as Principal Component Analysis (PCA), preserves the original data structure. Without proper evaluation, you risk discarding valuable patterns or introducing distortions that could impact subsequent analyses or model performance.

To assess the effectiveness of dimensionality reduction, two commonly used metrics are reconstruction error and explained variance ratio. These metrics quantify, respectively, how much information is lost and how much of the original data’s variability is retained. Their mathematical definitions are:

Reconstruction Error (for PCA):

Reconstruction Error=1ni=1nxix^i2\text{Reconstruction Error} = \frac{1}{n} \sum_{i=1}^{n} \| \mathbf{x}_i - \hat{\mathbf{x}}_i \|^2

where xi\mathbf{x}_i is the original data point, x^i\hat{\mathbf{x}}_i is the reconstructed data point after dimensionality reduction and inverse transformation, and nn is the number of samples.

Explained Variance Ratio (for PCA component kk):

Explained Variance Ratiok=λkj=1pλj\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_{j=1}^{p} \lambda_j}

where λk\lambda_k is the variance explained by the kk-th principal component, and pp is the total number of components in the original data.

123456789101112131415161718192021222324
import numpy as np from sklearn.decomposition import PCA from sklearn.datasets import load_iris # Load a sample dataset data = load_iris() X = data.data # Fit PCA with 2 components pca = PCA(n_components=2) X_reduced = pca.fit_transform(X) # Reconstruct the data from the reduced version X_reconstructed = pca.inverse_transform(X_reduced) # Compute explained variance ratio explained_variance_ratio = pca.explained_variance_ratio_ total_explained = np.sum(explained_variance_ratio) print(f"Explained variance ratio per component: {explained_variance_ratio}") print(f"Total explained variance (2 components): {total_explained:.4f}") # Compute mean squared reconstruction error reconstruction_error = np.mean(np.square(X - X_reconstructed)) print(f"Mean squared reconstruction error: {reconstruction_error:.4f}")
copy

When interpreting these metrics, a higher explained variance ratio indicates that the chosen components capture more of the original data’s variability, meaning less information is lost. For instance, if two principal components explain 95% of the variance, you retain most of the data’s structure in a lower-dimensional space. Conversely, a lower reconstruction error means that the reduced data, when mapped back to the original space, closely matches the original data, reflecting minimal information loss. Both metrics are crucial: explained_variance_ratio helps you decide how many components to keep, while reconstruction_error quantifies the fidelity of the dimensionality reduction. In practical terms, you should aim for a balance—retaining as much variance as necessary for your task while keeping the reconstruction error acceptably low.

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain how to choose the optimal number of principal components?

What are some common pitfalls when interpreting explained variance ratio and reconstruction error?

How do these metrics apply to other dimensionality reduction techniques besides PCA?

Awesome!

Completion rate improved to 6.25

bookDimensionality Reduction Evaluation

Свайпніть щоб показати меню

Dimensionality reduction is a fundamental technique in unsupervised learning, often used to simplify datasets, visualize high-dimensional data, or improve the efficiency of downstream models. However, reducing the number of features can lead to the loss of important information. This makes it essential to evaluate how well a dimensionality reduction method, such as Principal Component Analysis (PCA), preserves the original data structure. Without proper evaluation, you risk discarding valuable patterns or introducing distortions that could impact subsequent analyses or model performance.

To assess the effectiveness of dimensionality reduction, two commonly used metrics are reconstruction error and explained variance ratio. These metrics quantify, respectively, how much information is lost and how much of the original data’s variability is retained. Their mathematical definitions are:

Reconstruction Error (for PCA):

Reconstruction Error=1ni=1nxix^i2\text{Reconstruction Error} = \frac{1}{n} \sum_{i=1}^{n} \| \mathbf{x}_i - \hat{\mathbf{x}}_i \|^2

where xi\mathbf{x}_i is the original data point, x^i\hat{\mathbf{x}}_i is the reconstructed data point after dimensionality reduction and inverse transformation, and nn is the number of samples.

Explained Variance Ratio (for PCA component kk):

Explained Variance Ratiok=λkj=1pλj\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_{j=1}^{p} \lambda_j}

where λk\lambda_k is the variance explained by the kk-th principal component, and pp is the total number of components in the original data.

123456789101112131415161718192021222324
import numpy as np from sklearn.decomposition import PCA from sklearn.datasets import load_iris # Load a sample dataset data = load_iris() X = data.data # Fit PCA with 2 components pca = PCA(n_components=2) X_reduced = pca.fit_transform(X) # Reconstruct the data from the reduced version X_reconstructed = pca.inverse_transform(X_reduced) # Compute explained variance ratio explained_variance_ratio = pca.explained_variance_ratio_ total_explained = np.sum(explained_variance_ratio) print(f"Explained variance ratio per component: {explained_variance_ratio}") print(f"Total explained variance (2 components): {total_explained:.4f}") # Compute mean squared reconstruction error reconstruction_error = np.mean(np.square(X - X_reconstructed)) print(f"Mean squared reconstruction error: {reconstruction_error:.4f}")
copy

When interpreting these metrics, a higher explained variance ratio indicates that the chosen components capture more of the original data’s variability, meaning less information is lost. For instance, if two principal components explain 95% of the variance, you retain most of the data’s structure in a lower-dimensional space. Conversely, a lower reconstruction error means that the reduced data, when mapped back to the original space, closely matches the original data, reflecting minimal information loss. Both metrics are crucial: explained_variance_ratio helps you decide how many components to keep, while reconstruction_error quantifies the fidelity of the dimensionality reduction. In practical terms, you should aim for a balance—retaining as much variance as necessary for your task while keeping the reconstruction error acceptably low.

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 3
some-alt