Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Univariate Analysis | Section
Data Visualization & EDA

bookUnivariate Analysis

Stryg for at vise menuen

Univariate analysis is a foundational step in exploratory data analysis (EDA), focusing on examining each variable in your dataset independently. By analyzing variables one at a time, you can uncover essential characteristics such as central tendency, spread, shape, and the presence of outliers. This process helps you understand the basic properties of your data, identify potential data quality issues, and select appropriate techniques for further analysis. Univariate analysis is crucial for building intuition about your dataset before moving on to more complex, multivariate relationships.

123456789
import pandas as pd # Load a sample dataset url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv" df = pd.read_csv(url) # Select a single column for analysis: "total_bill" total_bill = df["total_bill"] print(total_bill.head())
copy
1234567891011121314
# Calculate descriptive statistics for the "total_bill" variable mean = total_bill.mean() median = total_bill.median() mode = total_bill.mode()[0] std = total_bill.std() min_value = total_bill.min() max_value = total_bill.max() print(f"Mean: {mean:.2f}") print(f"Median: {median:.2f}") print(f"Mode: {mode:.2f}") print(f"Standard Deviation: {std:.2f}") print(f"Min: {min_value:.2f}") print(f"Max: {max_value:.2f}")
copy
12345678910111213141516171819
import matplotlib.pyplot as plt import seaborn as sns # Create a histogram plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.histplot(total_bill, bins=20, kde=True) plt.title("Histogram of Total Bill") plt.xlabel("Total Bill") plt.ylabel("Frequency") # Create a boxplot plt.subplot(1, 2, 2) sns.boxplot(x=total_bill) plt.title("Boxplot of Total Bill") plt.xlabel("Total Bill") plt.tight_layout() plt.show()
copy

Interpreting Descriptive Statistics

  • Mean: shows the average value of total_bill;
  • Median: gives the middle value when all bills are sorted;
  • Mode: identifies the most frequently occurring bill amount;
  • Standard deviation: measures how spread out values are around the mean. A higher value means more variability;
  • Minimum and maximum: indicate the range of the data.

If the mean and median are close, the distribution is likely symmetric; if they differ, the data may be skewed.

Understanding Visualizations

  • Histogram: displays the distribution of total_bill. Peaks indicate common values, and the overall shape (symmetric, skewed left, or skewed right) reveals how most bills are distributed;
  • Boxplot: summarizes the spread, median, and potential outliers. The box shows the interquartile range (middle 50% of data), the line inside the box is the median, and points outside the "whiskers" are considered outliers.

By combining these statistics and visualizations, you can quickly spot unusual values, skewness, and the general pattern of your variable. This understanding guides your next steps in data cleaning and analysis.

question mark

Which statement best describes univariate analysis in the context of exploratory data analysis?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 21

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 21
some-alt