Univariate Analysis
Svep för att visa menyn
Univariate analysis is a foundational step in exploratory data analysis (EDA), focusing on examining each variable in your dataset independently. By analyzing variables one at a time, you can uncover essential characteristics such as central tendency, spread, shape, and the presence of outliers. This process helps you understand the basic properties of your data, identify potential data quality issues, and select appropriate techniques for further analysis. Univariate analysis is crucial for building intuition about your dataset before moving on to more complex, multivariate relationships.
123456789import pandas as pd # Load a sample dataset url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv" df = pd.read_csv(url) # Select a single column for analysis: "total_bill" total_bill = df["total_bill"] print(total_bill.head())
1234567891011121314# Calculate descriptive statistics for the "total_bill" variable mean = total_bill.mean() median = total_bill.median() mode = total_bill.mode()[0] std = total_bill.std() min_value = total_bill.min() max_value = total_bill.max() print(f"Mean: {mean:.2f}") print(f"Median: {median:.2f}") print(f"Mode: {mode:.2f}") print(f"Standard Deviation: {std:.2f}") print(f"Min: {min_value:.2f}") print(f"Max: {max_value:.2f}")
12345678910111213141516171819import matplotlib.pyplot as plt import seaborn as sns # Create a histogram plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.histplot(total_bill, bins=20, kde=True) plt.title("Histogram of Total Bill") plt.xlabel("Total Bill") plt.ylabel("Frequency") # Create a boxplot plt.subplot(1, 2, 2) sns.boxplot(x=total_bill) plt.title("Boxplot of Total Bill") plt.xlabel("Total Bill") plt.tight_layout() plt.show()
Interpreting Descriptive Statistics
- Mean: shows the average value of
total_bill; - Median: gives the middle value when all bills are sorted;
- Mode: identifies the most frequently occurring bill amount;
- Standard deviation: measures how spread out values are around the mean. A higher value means more variability;
- Minimum and maximum: indicate the range of the data.
If the mean and median are close, the distribution is likely symmetric; if they differ, the data may be skewed.
Understanding Visualizations
- Histogram: displays the distribution of
total_bill. Peaks indicate common values, and the overall shape (symmetric, skewed left, or skewed right) reveals how most bills are distributed; - Boxplot: summarizes the spread, median, and potential outliers. The box shows the interquartile range (middle 50% of data), the line inside the box is the median, and points outside the "whiskers" are considered outliers.
By combining these statistics and visualizations, you can quickly spot unusual values, skewness, and the general pattern of your variable. This understanding guides your next steps in data cleaning and analysis.
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal