Вивчайте Visualizing Centered vs. Uncentered Data

When you work with real-world datasets, the raw values of features can vary widely in scale and location. This can make it difficult to compare features directly or to use algorithms that assume certain data properties. Two common preprocessing steps are centering and standardization. Centering involves subtracting the mean from each feature, shifting the distribution so its average is zero. Standardization, also known as z-score normalization, goes a step further by also dividing by the standard deviation, resulting in features with a mean of zero and a standard deviation of one.

Centering changes the location of the distribution but does not affect its spread or shape. Standardization, on the other hand, not only centers the data but also rescales it, so that the distribution has a consistent spread. This is especially useful when features have different units or scales, as it makes them more comparable and can improve the performance of many machine learning algorithms.

Visualizing these transformations helps you understand their impact on a dataset. By plotting histograms of the original, centered, and standardized data, you can see how the distribution changes with each preprocessing step.


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample feature with nonzero mean and arbitrary scale
np.random.seed(42)
data = 3 * np.random.randn(1000) + 10  # Mean=10, Std=3

# Center the data (subtract mean)
centered_data = data - np.mean(data)

# Standardize the data (z-score)
standardized_data = (data - np.mean(data)) / np.std(data)

# Plot histograms
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

axes[0].hist(data, bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black')
axes[1].set_title('Centered Data')
axes[1].set_xlabel('Value')

axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black')
axes[2].set_title('Standardized Data')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()

Definition

The standard score (or z-score) expresses how many standard deviations a value is from the mean of its distribution. For a value $x$ , its z-score is calculated as $(x - \mu) / \sigma$ . This allows you to compare values from different distributions or scales on a common basis.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 3

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain the difference between centering and standardization in more detail?

Why is standardization important for machine learning algorithms?

Can you show how these transformations affect a dataset with multiple features?

Свайпніть щоб показати меню


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
import matplotlib.pyplot as plt

# Generate a sample feature with nonzero mean and arbitrary scale
np.random.seed(42)
data = 3 * np.random.randn(1000) + 10  # Mean=10, Std=3

# Center the data (subtract mean)
centered_data = data - np.mean(data)

# Standardize the data (z-score)
standardized_data = (data - np.mean(data)) / np.std(data)

# Plot histograms
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

axes[0].hist(data, bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(centered_data, bins=30, color='orange', edgecolor='black')
axes[1].set_title('Centered Data')
axes[1].set_xlabel('Value')

axes[2].hist(standardized_data, bins=30, color='green', edgecolor='black')
axes[2].set_title('Standardized Data')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()

Definition

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 3