Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Visualizing Centered vs. Uncentered Data | Foundations of Feature Scaling
Feature Scaling and Normalization Deep Dive

bookVisualizing Centered vs. Uncentered Data

When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.

Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.

Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.

12345678910111213141516171819202122232425262728293031
import numpy as np import matplotlib.pyplot as plt # Generate a sample feature with nonzero mean and arbitrary scale np.random.seed(42) data = 3 * np.random.randn(1000) + 10 # Mean=10, Std=3 # Center the data (subtract mean) centered_data = data - np.mean(data) # Standardize the data (z-score) standardized_data = (data - np.mean(data)) / np.std(data) # Plot histograms fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True) axes[0].hist(data, bins=30, color='skyblue', edgecolor='black') axes[0].set_title('Original Data') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black') axes[1].set_title('Centered Data') axes[1].set_xlabel('Value') axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black') axes[2].set_title('Standardized Data') axes[2].set_xlabel('Value') plt.tight_layout() plt.show()
copy
Note
Definition

The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value xx, its z-score is calculated as (xμ)/σ(x - \mu) / \sigma. This allows you to compare values from different distributions or scales on a common basis.

question mark

Which of the following statements is true about a histogram of standardized (z-scored) data?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 3

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain the difference between centering and standardization in more detail?

Why is standardization important for machine learning algorithms?

Can you show how these transformations affect a dataset with multiple features?

Awesome!

Completion rate improved to 5.26

bookVisualizing Centered vs. Uncentered Data

Pyyhkäise näyttääksesi valikon

When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.

Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.

Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.

12345678910111213141516171819202122232425262728293031
import numpy as np import matplotlib.pyplot as plt # Generate a sample feature with nonzero mean and arbitrary scale np.random.seed(42) data = 3 * np.random.randn(1000) + 10 # Mean=10, Std=3 # Center the data (subtract mean) centered_data = data - np.mean(data) # Standardize the data (z-score) standardized_data = (data - np.mean(data)) / np.std(data) # Plot histograms fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True) axes[0].hist(data, bins=30, color='skyblue', edgecolor='black') axes[0].set_title('Original Data') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black') axes[1].set_title('Centered Data') axes[1].set_xlabel('Value') axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black') axes[2].set_title('Standardized Data') axes[2].set_xlabel('Value') plt.tight_layout() plt.show()
copy
Note
Definition

The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value xx, its z-score is calculated as (xμ)/σ(x - \mu) / \sigma. This allows you to compare values from different distributions or scales on a common basis.

question mark

Which of the following statements is true about a histogram of standardized (z-scored) data?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 3
some-alt