Course Content
Optimization Techniques in Python
Optimization Techniques in Python
Handling Large Files
Processing large files efficiently is essential when working with datasets too big to fit in memory. Python provides tools like open()
and map()
, which allow you to process files lazily, saving memory and improving performance.
What Are Iterators?
Before proceeding with the open()
function, we should first understands what an iterator is. An iterator is an object that represents a stream of data, allowing you to access one item at a time. Iterators implement two methods:
__iter__()
: returns the iterator object itself;__next__()
: returns the next item in the stream and raises aStopIteration
exception when no items are left.
Let's say we have an iterator named iterator_object
. We can iterate over it using a usual for
loop:
In fact, under the hood, the following happens (the next()
function internally calls the __next__()
method of the iterator):
Unlike standard collections, iterators are characterized by lazy evaluation, meaning they generate or fetch data only when required, rather than loading everything into memory at once. This approach makes them highly memory-efficient, particularly when working with large datasets.
File Objects as Iterators
The open()
function returns a file object, which is an iterator. This allows you to:
- Iterate over a file line by line using a
for
loop; - Read one line at a time into memory, making it suitable for large files (as long as individual lines fit in memory).
For example, if a log file with 1,000,000
lines includes both INFO
and ERROR
messages, we can still count ERROR
occurrences by iterating through the file line by line, even if the file cannot fit entirely in memory (which will be the case if we add much more logs to it).
log_lines = [f"INFO: Log entry {i}" if i % 100 != 0 else f"ERROR: Critical issue {i}" for i in range(1, 1000001)] with open("large_log.txt", "w") as log_file: log_file.write("\n".join(log_lines)) # Process the file to count error entries error_count = 0 for line in open("large_log.txt"): if "ERROR" in line: error_count += 1 print(f"Total error entries: {error_count}")
Transforming File Lines with map()
As mentioned in the previous chapter, map()
returns an iterator, applying a transformation function lazily to each line in a file. Similar to file objects, map()
processes data one item at a time without loading everything into memory, making it an efficient option for handling large files.
For example, let's create a file containing 1000000
email addresses, some of which include uppercase letters. Our goal is to convert all the emails to lowercase and save the normalized results in a new file ('normalized_emails.txt'
). We'll use map()
to achieve this, ensuring the script remains efficient and suitable for processing even larger files.
# Create a file with mixed-case email addresses email_lines = [ "John.Doe@example.com", "Jane.SMITH@domain.org", "BOB.brown@anotherexample.net", "ALICE.williams@sample.com" ] * 250000 # Repeat to simulate a large file with open("email_list.txt", "w") as email_file: email_file.write("\n".join(email_lines)) # Process the file to standardize email addresses (convert to lowercase) with open("email_list.txt") as input_file, open("normalized_emails.txt", "w") as output_file: # Use map() to convert each email to lowercase lowercase_emails = map(str.lower, input_file) for email in lowercase_emails: output_file.write(email) # Print the last email to verify the results print(email) print('Done')
Thanks for your feedback!