Oppiskele Visualizing Centered vs. Uncentered Data

When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.

Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.

Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample feature with nonzero mean and arbitrary scale
np.random.seed(42)
data = 3 * np.random.randn(1000) + 10  # Mean=10, Std=3

# Center the data (subtract mean)
centered_data = data - np.mean(data)

# Standardize the data (z-score)
standardized_data = (data - np.mean(data)) / np.std(data)

# Plot histograms
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

axes[0].hist(data, bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black')
axes[1].set_title('Centered Data')
axes[1].set_xlabel('Value')

axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black')
axes[2].set_title('Standardized Data')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()

Definition

The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value $x$ , its z-score is calculated as $(x - \mu) / \sigma$ . This allows you to compare values from different distributions or scales on a common basis.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain the difference between centering and standardization in more detail?

Why is standardization important for machine learning algorithms?

Can you show how these transformations affect a dataset with multiple features?

Pyyhkäise näyttääksesi valikon


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample feature with nonzero mean and arbitrary scale
np.random.seed(42)
data = 3 * np.random.randn(1000) + 10  # Mean=10, Std=3

# Center the data (subtract mean)
centered_data = data - np.mean(data)

# Standardize the data (z-score)
standardized_data = (data - np.mean(data)) / np.std(data)

# Plot histograms
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

axes[0].hist(data, bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black')
axes[1].set_title('Centered Data')
axes[1].set_xlabel('Value')

axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black')
axes[2].set_title('Standardized Data')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()

Definition

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3