Reducing Dimensions by Maximizing Variance
PCA ranks principal components by the variance they capture, measured by their eigenvalues. Keeping the top k components preserves the most variance, as each component captures less than the previous one and is orthogonal to earlier components. This reduces dimensions while retaining the most informative directions in your data.
The explained variance ratio for each principal component is:
ExplainedΒ VarianceΒ Ratio=βjβΞ»jβΞ»iββwhere Ξ»iβ is the i-th largest eigenvalue. This ratio shows how much of the total variance in your data is captured by each principal component. The sum of all explained variance ratios is always 1, since all eigenvalues together account for the total variance in the dataset.
123456789101112import numpy as np # Using eigenvalues from previous code X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9]]) X_centered = X - np.mean(X, axis=0) cov_matrix = (X_centered.T @ X_centered) / X_centered.shape[0] values, vectors = np.linalg.eig(cov_matrix) explained_variance_ratio = values / np.sum(values) print("Explained variance ratio:", explained_variance_ratio)
Selecting the top principal components so that their explained variance ratios add up to a specific threshold - such as 95% - lets you reduce the number of dimensions while keeping most of the data's information. This means you only keep the directions in your data where the spread is greatest, which are the most informative for analysis or modeling. By focusing on these components, you simplify your dataset without losing the patterns that matter most. This balance between dimensionality and information is a key advantage of PCA.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain how to choose the optimal number of principal components?
What happens if I keep too few or too many principal components?
Can you show how to calculate the cumulative explained variance?
Awesome!
Completion rate improved to 8.33
Reducing Dimensions by Maximizing Variance
Swipe to show menu
PCA ranks principal components by the variance they capture, measured by their eigenvalues. Keeping the top k components preserves the most variance, as each component captures less than the previous one and is orthogonal to earlier components. This reduces dimensions while retaining the most informative directions in your data.
The explained variance ratio for each principal component is:
ExplainedΒ VarianceΒ Ratio=βjβΞ»jβΞ»iββwhere Ξ»iβ is the i-th largest eigenvalue. This ratio shows how much of the total variance in your data is captured by each principal component. The sum of all explained variance ratios is always 1, since all eigenvalues together account for the total variance in the dataset.
123456789101112import numpy as np # Using eigenvalues from previous code X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9]]) X_centered = X - np.mean(X, axis=0) cov_matrix = (X_centered.T @ X_centered) / X_centered.shape[0] values, vectors = np.linalg.eig(cov_matrix) explained_variance_ratio = values / np.sum(values) print("Explained variance ratio:", explained_variance_ratio)
Selecting the top principal components so that their explained variance ratios add up to a specific threshold - such as 95% - lets you reduce the number of dimensions while keeping most of the data's information. This means you only keep the directions in your data where the spread is greatest, which are the most informative for analysis or modeling. By focusing on these components, you simplify your dataset without losing the patterns that matter most. This balance between dimensionality and information is a key advantage of PCA.
Thanks for your feedback!