Visualizing Centered vs. Uncentered Data
When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.
Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.
Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.
12345678910111213141516171819202122232425262728293031import numpy as np import matplotlib.pyplot as plt # Generate a sample feature with nonzero mean and arbitrary scale np.random.seed(42) data = 3 * np.random.randn(1000) + 10 # Mean=10, Std=3 # Center the data (subtract mean) centered_data = data - np.mean(data) # Standardize the data (z-score) standardized_data = (data - np.mean(data)) / np.std(data) # Plot histograms fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True) axes[0].hist(data, bins=30, color='skyblue', edgecolor='black') axes[0].set_title('Original Data') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black') axes[1].set_title('Centered Data') axes[1].set_xlabel('Value') axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black') axes[2].set_title('Standardized Data') axes[2].set_xlabel('Value') plt.tight_layout() plt.show()
The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value x, its z-score is calculated as (x−μ)/σ. This allows you to compare values from different distributions or scales on a common basis.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Awesome!
Completion rate improved to 5.26
Visualizing Centered vs. Uncentered Data
Свайпніть щоб показати меню
When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.
Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.
Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.
12345678910111213141516171819202122232425262728293031import numpy as np import matplotlib.pyplot as plt # Generate a sample feature with nonzero mean and arbitrary scale np.random.seed(42) data = 3 * np.random.randn(1000) + 10 # Mean=10, Std=3 # Center the data (subtract mean) centered_data = data - np.mean(data) # Standardize the data (z-score) standardized_data = (data - np.mean(data)) / np.std(data) # Plot histograms fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True) axes[0].hist(data, bins=30, color='skyblue', edgecolor='black') axes[0].set_title('Original Data') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black') axes[1].set_title('Centered Data') axes[1].set_xlabel('Value') axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black') axes[2].set_title('Standardized Data') axes[2].set_xlabel('Value') plt.tight_layout() plt.show()
The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value x, its z-score is calculated as (x−μ)/σ. This allows you to compare values from different distributions or scales on a common basis.
Дякуємо за ваш відгук!