Working with Datasets

To simplify data preparation for machine learning models and enable efficient batch processing, shuffling, and data handling, PyTorch provides the TensorDataset and DataLoader utilities.

Loading and Inspecting the Dataset

We'll use a dataset (wine.csv) containing data about different kinds of wine, including their features and corresponding class labels.

First, let's load the dataset and inspect its structure to understand the features and target variable:


              123
            
import pandas as pd
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
print(wine_df.head())

Creating a TensorDataset

The next step is to separate the features and target, convert them into PyTorch tensors, and use these tensors directly to create a TensorDataset. We'll ensure that the features are of type float32 (for handling floating-point numbers) and the target is of type long (a 64-bit integer type suitable for labels).

import pandas as pd
import torch
from torch.utils.data import TensorDataset
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
    torch.tensor(features, dtype=torch.float32),  # Features tensor
    torch.tensor(target, dtype=torch.long)        # Target tensor
)

Using DataLoader for Batch Processing

To facilitate batch processing, shuffling, and efficient data loading during training, we wrap the TensorDataset in a DataLoader. This step is crucial for managing the flow of data to the model during training, especially when working with larger datasets. The DataLoader allows us to:

Batch process: split the data into smaller, manageable chunks (batches) for training, which optimizes memory usage and allows gradient updates after each batch;
Shuffle: randomize the order of the dataset, which helps break any inherent ordering in the data and prevents the model from learning spurious patterns;
Efficient loading: automatically handle data fetching and preprocessing for each batch during training, reducing overhead.


              123456789101112131415161718
            
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
wine_df = pd.read_csv('https://content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
    torch.tensor(features, dtype=torch.float32),  # Features tensor
    torch.tensor(target, dtype=torch.long)        # Target tensor
)
# Wrap the dataset in a DataLoader
wine_loader = DataLoader(
    wine_dataset,  # TensorDataset
    batch_size=32, # Number of samples per batch
    shuffle=True   # Randomize the order of the data
)

With this setup, the DataLoader ensures that the model receives batches of data efficiently and in random order. This is especially important for training neural networks, as it helps the model generalize better to unseen data.

Iterating Over the DataLoader

We can now iterate over the DataLoader to access batches of data. Each batch contains a tuple (batch_features, batch_targets):


              123456789101112131415161718192021222324
            
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
wine_df = pd.read_csv('https://staging-content-media-cdn.codefinity.com/courses/1dd2b0f6-6ec0-40e6-a570-ed0ac2209666/section_2/wine.csv')
# Separate features and target
features = wine_df.drop(columns='quality').values
target = wine_df['quality'].values
# Create TensorDataset
wine_dataset = TensorDataset(
    torch.tensor(features, dtype=torch.float32),  # Features tensor
    torch.tensor(target, dtype=torch.long)        # Target tensor
)
# Wrap the dataset in a DataLoader
wine_loader = DataLoader(
    wine_dataset,  # TensorDataset
    batch_size=32, # Number of samples per batch
    shuffle=True   # Randomize the order of the data
)
# Iterate through batches
for batch_idx, (batch_features, batch_targets) in enumerate(wine_loader):
    print(f"Batch {batch_idx+1}")
    print(f"Features: {batch_features}")
    print(f"Targets: {batch_targets}")
    print("-" * 30)

Var alt klart?

Tak for dine kommentarer!

Sektion 2. Kapitel 5

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Kursusindhold

PyTorch Essentials