Lære Mean-Centering and Z-Score Standardization | Foundations of Feature Scaling

To effectively prepare your data for machine learning, you need to understand how to transform features so that they are on comparable scales. Two of the most fundamental techniques for this are mean-centering and z-score standardization. Both are rooted in statistical concepts and have precise mathematical formulas. Let’s break down how each is derived and why they matter.

Derivation of Mean-Centering and Z-Score Standardization Formulas

Suppose you have a set of data points for a single feature, represented as a vector:

x = [x₁, x₂, ..., xₙ]

Step 1: Compute the Mean

The first step in both mean-centering and standardization is to calculate the mean of the feature:

\mu = \frac{1}{n} \sum_{i=1}^{n} x_i

This value, $μ$ , represents the average value of your feature.

Step 2: Mean-Centering

Mean-centering involves subtracting the mean from each data point. The formula for the mean-centered value $x'_i$ is:

x'_i = x_i - \mu

This transformation shifts the entire distribution of the feature so that its mean becomes zero, but it does not change the spread or scale of the data.

Step 3: Standard Deviation

For standardization, you also need the standard deviation, which measures the spread of the data:

\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2}

Step 4: Z-Score Standardization

Z-score standardization, also known as standard scaling, transforms each value by subtracting the mean and then dividing by the standard deviation:

z_i = \frac{x_i - \mu}{\sigma}

After this transformation, the feature will have a mean of zero and a standard deviation of one.

Each step in these derivations ensures that your data is centered (mean zero) and, if standardized, also scaled (unit variance), which is crucial for many machine learning algorithms that are sensitive to feature scale.

Note

Mean-centering shifts your data so that it revolves around zero, removing any inherent bias in the feature’s location. This is especially helpful when you want to emphasize differences between data points rather than their absolute values. Z-score standardization goes further by also scaling the data so that its spread (variance) is uniform. Use mean-centering when you only need to remove the mean, such as before applying Principal Component Analysis (PCA). Use z-score standardization when your model is sensitive to the scale of the data, like in algorithms that rely on distances or gradients.


              123456789101112131415161718
            
import numpy as np

# Create a synthetic dataset (single feature)
X = np.array([10, 12, 14, 16, 18], dtype=float)

# Mean-centering
mean = np.mean(X)
X_centered = X - mean

# Standard deviation
std = np.std(X)

# Z-score standardization
X_standardized = (X - mean) / std

print("Original data:", X)
print("Mean-centered data:", X_centered)
print("Standardized data:", X_standardized)

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Sveip for å vise menyen