Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära VC Dimension: Measuring Hypothesis Complexity | Classical Generalization Bounds
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Generalization Bounds

bookVC Dimension: Measuring Hypothesis Complexity

When you train a machine learning model, you choose a set of possible functions — called a hypothesis class — from which your algorithm selects the best fit for the data. However, not all hypothesis classes are equally complex. Some can represent a wide variety of patterns, while others are more limited. This complexity plays a crucial role in how well your model generalizes to unseen data: a more complex hypothesis class can fit the training data more closely, but it also runs a higher risk of overfitting, which means performing poorly on new examples. To analyze and control this trade-off, you need a way to quantify the complexity of a hypothesis class in a mathematically rigorous way.

Note
Definition

The VC dimension (Vapnik–Chervonenkis dimension) of a hypothesis class is the largest number of points that can be labeled in all possible ways (that is, "shattered") by hypotheses from that class. To shatter a set of points means that, for every possible assignment of labels to those points, there exists a hypothesis in the class that correctly classifies them.

Intuitive Example: Shattering with Intervals on the Real Line
expand arrow

Suppose your hypotheses are all intervals on the real number line, and your task is to classify points as "inside" or "outside" an interval. With one point, you can certainly label it either way — so one point can be shattered. With two points, you can choose intervals to achieve all four possible labelings. But with three points, you cannot achieve all eight possible labelings with a single interval (for example, labeling the first and third points positive and the middle one negative is impossible). Thus, three points cannot be shattered by intervals, and the VC dimension of intervals on the real line is 2.

Formal Definition of VC Dimension
expand arrow

Formally, the VC dimension of a hypothesis class HH is the largest integer dd such that there exists a set of dd points that can be shattered by HH. If sets of arbitrary size can be shattered, the VC dimension is infinite.

123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np import matplotlib.pyplot as plt def can_shatter(points): n = len(points) for labels in range(2 ** n): found = False for i in range(n): for j in range(i, n): interval = (points[i], points[j]) predicted = [(interval[0] <= x <= interval[1]) for x in points] label_bits = [(labels >> k) & 1 for k in range(n)] if predicted == label_bits: found = True break if found: break if not found: return False return True max_points = 5 vc_dims = [] for k in range(1, max_points + 1): # Place k points equally spaced points = np.linspace(0, 1, k) if can_shatter(points): vc_dims.append(1) else: vc_dims.append(0) plt.figure(figsize=(6, 3)) plt.bar(range(1, max_points + 1), vc_dims) plt.xlabel("Number of points") plt.ylabel("Can be shattered (1=True, 0=False)") plt.title("VC Dimension for Intervals on the Real Line") plt.xticks(range(1, max_points + 1)) plt.ylim(-0.1, 1.1) plt.show()
copy
question mark

How does increasing the VC dimension of a hypothesis class affect the generalization bound?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

bookVC Dimension: Measuring Hypothesis Complexity

Svep för att visa menyn

When you train a machine learning model, you choose a set of possible functions — called a hypothesis class — from which your algorithm selects the best fit for the data. However, not all hypothesis classes are equally complex. Some can represent a wide variety of patterns, while others are more limited. This complexity plays a crucial role in how well your model generalizes to unseen data: a more complex hypothesis class can fit the training data more closely, but it also runs a higher risk of overfitting, which means performing poorly on new examples. To analyze and control this trade-off, you need a way to quantify the complexity of a hypothesis class in a mathematically rigorous way.

Note
Definition

The VC dimension (Vapnik–Chervonenkis dimension) of a hypothesis class is the largest number of points that can be labeled in all possible ways (that is, "shattered") by hypotheses from that class. To shatter a set of points means that, for every possible assignment of labels to those points, there exists a hypothesis in the class that correctly classifies them.

Intuitive Example: Shattering with Intervals on the Real Line
expand arrow

Suppose your hypotheses are all intervals on the real number line, and your task is to classify points as "inside" or "outside" an interval. With one point, you can certainly label it either way — so one point can be shattered. With two points, you can choose intervals to achieve all four possible labelings. But with three points, you cannot achieve all eight possible labelings with a single interval (for example, labeling the first and third points positive and the middle one negative is impossible). Thus, three points cannot be shattered by intervals, and the VC dimension of intervals on the real line is 2.

Formal Definition of VC Dimension
expand arrow

Formally, the VC dimension of a hypothesis class HH is the largest integer dd such that there exists a set of dd points that can be shattered by HH. If sets of arbitrary size can be shattered, the VC dimension is infinite.

123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np import matplotlib.pyplot as plt def can_shatter(points): n = len(points) for labels in range(2 ** n): found = False for i in range(n): for j in range(i, n): interval = (points[i], points[j]) predicted = [(interval[0] <= x <= interval[1]) for x in points] label_bits = [(labels >> k) & 1 for k in range(n)] if predicted == label_bits: found = True break if found: break if not found: return False return True max_points = 5 vc_dims = [] for k in range(1, max_points + 1): # Place k points equally spaced points = np.linspace(0, 1, k) if can_shatter(points): vc_dims.append(1) else: vc_dims.append(0) plt.figure(figsize=(6, 3)) plt.bar(range(1, max_points + 1), vc_dims) plt.xlabel("Number of points") plt.ylabel("Can be shattered (1=True, 0=False)") plt.title("VC Dimension for Intervals on the Real Line") plt.xticks(range(1, max_points + 1)) plt.ylim(-0.1, 1.1) plt.show()
copy
question mark

How does increasing the VC dimension of a hypothesis class affect the generalization bound?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2
some-alt