Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Visualizing Centered vs. Uncentered Data | Foundations of Feature Scaling
Feature Scaling and Normalization in Python

bookVisualizing Centered vs. Uncentered Data

メニューを表示するにはスワイプしてください

When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.

Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.

Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.

12345678910111213141516171819202122232425262728293031
import numpy as np import matplotlib.pyplot as plt # Generate a sample feature with nonzero mean and arbitrary scale np.random.seed(42) data = 3 * np.random.randn(1000) + 10 # Mean=10, Std=3 # Center the data (subtract mean) centered_data = data - np.mean(data) # Standardize the data (z-score) standardized_data = (data - np.mean(data)) / np.std(data) # Plot histograms fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True) axes[0].hist(data, bins=30, color='skyblue', edgecolor='black') axes[0].set_title('Original Data') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black') axes[1].set_title('Centered Data') axes[1].set_xlabel('Value') axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black') axes[2].set_title('Standardized Data') axes[2].set_xlabel('Value') plt.tight_layout() plt.show()
copy
Note
Definition

The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value xx, its z-score is calculated as (xμ)/σ(x - \mu) / \sigma. This allows you to compare values from different distributions or scales on a common basis.

question mark

Which of the following statements is true about a histogram of standardized (z-scored) data?

すべての正しい答えを選択

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  3

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  3
some-alt