Setting Up a SparkSession
Sveip for å vise menyen
Every PySpark application starts with a SparkSession. It is the single entry point for reading data, running SQL, and configuring Spark behavior. Before you can work with any DataFrame or RDD, you need one.
Creating a SparkSession
1234567891011121314from pyspark.sql import SparkSession import urllib.request # Downloading the dataset urllib.request.urlretrieve( "https://content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FlightsAnalysis") \ .master("local[*]") \ .config("spark.sql.shuffle.partitions", "4") \ .getOrCreate()
Each configuration option:
appName: a human-readable name shown in the Spark UI and logs;master("local[*]"): run locally using all CPU cores. On a real cluster this would be a cluster URL;config("spark.sql.shuffle.partitions", "4"): reduces the default 200 shuffle partitions to 4, which is more appropriate for small local datasets;getOrCreate(): returns an existing session if one is already running, or creates a new one.
Loading the Flights Dataset
Once the session is created, you can load data immediately:
123456789101112# Loading the flights dataset flights_df = spark.read.csv( "flights.csv", header=True, inferSchema=True ) # Printing the schema to verify column types flights_df.printSchema() # Previewing the first 5 rows flights_df.show(5)
inferSchema=True tells Spark to scan the file and detect column types automatically. For large files this adds a pass over the data – if performance matters, define the schema explicitly.
Stopping the Session
When you are done, release resources:
1spark.stop()
In a notebook environment you typically leave the session running across cells. In a standalone script, always stop it at the end.
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår