Svep för att visa menyn

Every PySpark application starts with a SparkSession. It is the single entry point for reading data, running SQL, and configuring Spark behavior. Before you can work with any DataFrame or RDD, you need one.

Creating a SparkSession


              1234567891011121314
            
from pyspark.sql import SparkSession
import urllib.request

# Downloading the dataset
urllib.request.urlretrieve(
    "https://content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FlightsAnalysis") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

Each configuration option:

appName: a human-readable name shown in the Spark UI and logs;
master("local[*]"): run locally using all CPU cores. On a real cluster this would be a cluster URL;
config("spark.sql.shuffle.partitions", "4"): reduces the default 200 shuffle partitions to 4, which is more appropriate for small local datasets;
getOrCreate(): returns an existing session if one is already running, or creates a new one.

Loading the Flights Dataset

Once the session is created, you can load data immediately:


              123456789101112
            
# Loading the flights dataset
flights_df = spark.read.csv(
    "flights.csv",
    header=True,
    inferSchema=True
)

# Printing the schema to verify column types
flights_df.printSchema()

# Previewing the first 5 rows
flights_df.show(5)

inferSchema=True tells Spark to scan the file and detect column types automatically. For large files this adds a pass over the data – if performance matters, define the schema explicitly.

Stopping the Session

When you are done, release resources:


              1
            
spark.stop()

In a notebook environment you typically leave the session running across cells. In a standalone script, always stop it at the end.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 3

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Setting Up a SparkSession