Вивчайте Generators vs Lists for Large Data

Свайпніть щоб показати меню

A list stores all its elements in memory at once. A generator produces elements one at a time, on demand, holding only the current element in memory. For large datasets, this difference is the gap between a program that runs and one that crashes with MemoryError.

The Memory Cost of Lists

When you build a list comprehension, Python allocates memory for every element immediately:


              1234567891011121314
            
import sys
import tracemalloc

tracemalloc.start()

# Building a full list of 1 million records in memory
records_list = [{"id": record_id, "value": record_id * 2.5} for record_id in range(100000)]

snapshot = tracemalloc.take_snapshot()
total = sum(s.size for s in snapshot.statistics("lineno"))
print(f"List memory: {total / 1024 / 1024:.1f} MB")

del records_list
tracemalloc.stop()

Generators Use Near-Zero Memory

A generator expression has the same syntax as a list comprehension but with parentheses instead of brackets. It stores no elements – it yields them one at a time:


              123456789101112131415
            
import sys
import tracemalloc

tracemalloc.start()

# Generator holds no elements – just the recipe to produce them
records_generator = ({"id": record_id, "value": record_id * 2.5} for record_id in range(1000000))

snapshot = tracemalloc.take_snapshot()
total = sum(s.size for s in snapshot.statistics("lineno"))
print(f"Generator memory: {total / 1024 / 1024:.1f} MB")  # Near zero

print(sys.getsizeof(records_generator))  # ~200 bytes regardless of range size

tracemalloc.stop()

The generator object itself is tiny – it holds only the iterator state, not the data.

Processing Large Files with Generators

Generators are the standard tool for processing files that don't fit in memory:


              123456789101112131415
            
# Reading and processing a large CSV-like dataset line by line
def parse_transactions(filename):
    with open(filename, "r") as file:
        next(file)  # Skipping the header line
        for line in file:
            parts = line.strip().split(",")
            yield {"id": parts[0], "amount": float(parts[1]), "currency": parts[2]}

# Processing without loading the full file into memory
def total_revenue(filename):
    total = 0.0
    for transaction in parse_transactions(filename):
        if transaction["currency"] == "USD":
            total += transaction["amount"]
    return total

Only one line is in memory at any time, regardless of file size.

Generator Functions vs Generator Expressions


              12345678
            
# Generator expression – inline, for simple transformations
amounts = (row["amount"] for row in parse_transactions("transactions.csv"))

# Generator function – for multi-step logic with yield
def high_value_transactions(filename, threshold):
    for transaction in parse_transactions(filename):
        if transaction["amount"] > threshold:
            yield transaction

When to Use Each

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 2. Розділ 2