Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Working with Datasets | More Advanced Concepts
PyTorch Essentials

book
Working with Datasets

To simplify data preparation for machine learning models and enable efficient batch processing, shuffling, and data handling, PyTorch provides the TensorDataset and DataLoader utilities.

Loading and Inspecting the Dataset

We'll use a dataset (wine.csv) containing data about different kinds of wine, including their features and corresponding class labels.

First, let's load the dataset and inspect its structure to understand the features and target variable:

import pandas as pd
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
print(wine_df.head())
123
import pandas as pd wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv') print(wine_df.head())
copy

Creating a TensorDataset

The next step is to separate the features and target, convert them into PyTorch tensors, and use these tensors directly to create a TensorDataset. We'll ensure that the features are of type float32 (for handling floating-point numbers) and the target is of type long (a 64-bit integer type suitable for labels).

python
import pandas as pd
import torch
from torch.utils.data import TensorDataset
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
torch.tensor(features, dtype=torch.float32), # Features tensor
torch.tensor(target, dtype=torch.long) # Target tensor
)

Using DataLoader for Batch Processing

To facilitate batch processing, shuffling, and efficient data loading during training, we wrap the TensorDataset in a DataLoader. This step is crucial for managing the flow of data to the model during training, especially when working with larger datasets. The DataLoader allows us to:

  1. Batch process: split the data into smaller, manageable chunks (batches) for training, which optimizes memory usage and allows gradient updates after each batch;

  2. Shuffle: randomize the order of the dataset, which helps break any inherent ordering in the data and prevents the model from learning spurious patterns;

  3. Efficient loading: automatically handle data fetching and preprocessing for each batch during training, reducing overhead.

import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
torch.tensor(features, dtype=torch.float32), # Features tensor
torch.tensor(target, dtype=torch.long) # Target tensor
)
# Wrap the dataset in a DataLoader
wine_loader = DataLoader(
wine_dataset, # TensorDataset
batch_size=32, # Number of samples per batch
shuffle=True # Randomize the order of the data
)
123456789101112131415161718
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv') # Separate features and target features = wine_df.drop(columns='quality').values target = wine_df['quality'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data )
copy

With this setup, the DataLoader ensures that the model receives batches of data efficiently and in random order. This is especially important for training neural networks, as it helps the model generalize better to unseen data.

Iterating Over the DataLoader

We can now iterate over the DataLoader to access batches of data. Each batch contains a tuple (batch_features, batch_targets):

import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
wine_df = pd.read_csv('https://staging-content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
torch.tensor(features, dtype=torch.float32), # Features tensor
torch.tensor(target, dtype=torch.long) # Target tensor
)
# Wrap the dataset in a DataLoader
wine_loader = DataLoader(
wine_dataset, # TensorDataset
batch_size=32, # Number of samples per batch
shuffle=True # Randomize the order of the data
)
# Iterate through batches
for batch_idx, (batch_features, batch_targets) in enumerate(wine_loader):
print(f"Batch {batch_idx+1}")
print(f"Features: {batch_features}")
print(f"Targets: {batch_targets}")
print("-" * 30)
123456789101112131415161718192021222324
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('https://staging-content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv') # Separate features and target features = wine_df.drop(columns='quality').values target = wine_df['quality'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data ) # Iterate through batches for batch_idx, (batch_features, batch_targets) in enumerate(wine_loader): print(f"Batch {batch_idx+1}") print(f"Features: {batch_features}") print(f"Targets: {batch_targets}") print("-" * 30)
copy

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 5

Spørg AI

expand
ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

some-alt