Summary  
This chapter explains how to initialize and configure a SparkSession, load a CSV file into a DataFrame with schema inference, inspect its schema and sample rows via printing, and properly stop the session.  

General domain of usage  
Big data processing and analytics

Every PySpark application starts with a `SparkSession`. It is the single entry point for reading data, running SQL, and configuring Spark behavior. Before you can work with any DataFrame or RDD, you need one.

## Creating a SparkSession



from pyspark.sql import SparkSession
import urllib.request

# Downloading the dataset
urllib.request.urlretrieve(
    "https://content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FlightsAnalysis") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

Each configuration option:

- `appName`: a human-readable name shown in the Spark UI and logs;
- `master("local[*]")`: run locally using all CPU cores. On a real cluster this would be a cluster URL;
- `config("spark.sql.shuffle.partitions", "4")`: reduces the default 200 shuffle partitions to 4, which is more appropriate for small local datasets;
- `getOrCreate()`: returns an existing session if one is already running, or creates a new one.



## Loading the Flights Dataset

Once the session is created, you can load data immediately:

# Loading the flights dataset
flights_df = spark.read.csv(
    "flights.csv",
    header=True,
    inferSchema=True
)

# Printing the schema to verify column types
flights_df.printSchema()

# Previewing the first 5 rows
flights_df.show(5)

`inferSchema=True` tells Spark to scan the file and detect column types automatically. For large files this adds a pass over the data – if performance matters, define the schema explicitly.



## Stopping the Session

When you are done, release resources:

In a notebook environment you typically leave the session running across cells. In a standalone script, always stop it at the end.

Dive into the fundamentals of big data processing with PySpark – from Spark's distributed architecture and RDDs to the DataFrame API for scalable, real-world data analysis.

Explore the foundations of PySpark, from understanding big data and Spark's architecture to hands-on practice with RDDs and DataFrames.

Setting Up a SparkSession

Creating a SparkSession

Loading the Flights Dataset

Stopping the Session