Stryg for at vise menuen

Univariate analysis is a foundational step in exploratory data analysis (EDA), focusing on examining each variable in your dataset independently. By analyzing variables one at a time, you can uncover essential characteristics such as central tendency, spread, shape, and the presence of outliers. This process helps you understand the basic properties of your data, identify potential data quality issues, and select appropriate techniques for further analysis. Univariate analysis is crucial for building intuition about your dataset before moving on to more complex, multivariate relationships.


              123456789
            
import pandas as pd

# Load a sample dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
df = pd.read_csv(url)

# Select a single column for analysis: "total_bill"
total_bill = df["total_bill"]
print(total_bill.head())


              1234567891011121314
            
# Calculate descriptive statistics for the "total_bill" variable
mean = total_bill.mean()
median = total_bill.median()
mode = total_bill.mode()[0]
std = total_bill.std()
min_value = total_bill.min()
max_value = total_bill.max()

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode:.2f}")
print(f"Standard Deviation: {std:.2f}")
print(f"Min: {min_value:.2f}")
print(f"Max: {max_value:.2f}")


              12345678910111213141516171819
            
import matplotlib.pyplot as plt
import seaborn as sns

# Create a histogram
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(total_bill, bins=20, kde=True)
plt.title("Histogram of Total Bill")
plt.xlabel("Total Bill")
plt.ylabel("Frequency")

# Create a boxplot
plt.subplot(1, 2, 2)
sns.boxplot(x=total_bill)
plt.title("Boxplot of Total Bill")
plt.xlabel("Total Bill")

plt.tight_layout()
plt.show()

Interpreting Descriptive Statistics

Mean: shows the average value of total_bill;
Median: gives the middle value when all bills are sorted;
Mode: identifies the most frequently occurring bill amount;
Standard deviation: measures how spread out values are around the mean. A higher value means more variability;
Minimum and maximum: indicate the range of the data.

If the mean and median are close, the distribution is likely symmetric; if they differ, the data may be skewed.

Understanding Visualizations

Histogram: displays the distribution of total_bill. Peaks indicate common values, and the overall shape (symmetric, skewed left, or skewed right) reveals how most bills are distributed;
Boxplot: summarizes the spread, median, and potential outliers. The box shows the interquartile range (middle 50% of data), the line inside the box is the median, and points outside the "whiskers" are considered outliers.

By combining these statistics and visualizations, you can quickly spot unusual values, skewness, and the general pattern of your variable. This understanding guides your next steps in data cleaning and analysis.

Var alt klart?

Tak for dine kommentarer!

Sektion 1. Kapitel 21

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Univariate Analysis


              123456789
            
import pandas as pd

# Load a sample dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
df = pd.read_csv(url)

# Select a single column for analysis: "total_bill"
total_bill = df["total_bill"]
print(total_bill.head())


              1234567891011121314
            
# Calculate descriptive statistics for the "total_bill" variable
mean = total_bill.mean()
median = total_bill.median()
mode = total_bill.mode()[0]
std = total_bill.std()
min_value = total_bill.min()
max_value = total_bill.max()

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode:.2f}")
print(f"Standard Deviation: {std:.2f}")
print(f"Min: {min_value:.2f}")
print(f"Max: {max_value:.2f}")


              12345678910111213141516171819
            
import matplotlib.pyplot as plt
import seaborn as sns

# Create a histogram
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(total_bill, bins=20, kde=True)
plt.title("Histogram of Total Bill")
plt.xlabel("Total Bill")
plt.ylabel("Frequency")

# Create a boxplot
plt.subplot(1, 2, 2)
sns.boxplot(x=total_bill)
plt.title("Boxplot of Total Bill")
plt.xlabel("Total Bill")

plt.tight_layout()
plt.show()