Learn Exploring Data with Descriptive Statistics | Data Collection and Cleaning for Journalists

Swipe to show menu

Descriptive statistics are essential tools for journalists who want to summarize and communicate key aspects of their datasets. When dealing with collections of articles, such as word counts or publication dates, statistics like the mean, median, and mode help you quickly understand the central tendencies and patterns within your data. The mean provides the average value, the median identifies the midpoint, and the mode highlights the most frequently occurring value. These measures are important in reporting because they allow you to describe trends, compare sources, and spot outliers or shifts in reporting style.


              123456789101112131415161718192021222324
            
import pandas as pd

# Example dataset: word counts of recent articles
data = {
    "title": [
        "City Council Approves Budget",
        "Local School Wins Award",
        "Mayor Launches New Initiative",
        "Community Garden Flourishes",
        "Sports Team Advances to Finals"
    ],
    "word_count": [850, 400, 1200, 650, 950]
}

df = pd.DataFrame(data)

# Calculate descriptive statistics
mean_word_count = df["word_count"].mean()
median_word_count = df["word_count"].median()
mode_word_count = df["word_count"].mode()[0]

print("Mean word count:", mean_word_count)
print("Median word count:", median_word_count)
print("Mode word count:", mode_word_count)

By applying these calculations to a dataset of article word counts, you can quickly summarize the typical length of articles, which may reflect editorial standards or audience preferences. For instance, if the mean and median are close, it suggests most articles are similarly sized. If the mode differs significantly, it might indicate a common template or repeated format, such as brief updates or long-form features. Journalists can use these insights to compare reporting styles across outlets or time periods, inform editorial decisions, or highlight notable changes in coverage.


              12345678
            
import matplotlib.pyplot as plt

# Visualize the distribution of article word counts
plt.hist(df["word_count"], bins=5, edgecolor="black")
plt.title("Distribution of Article Word Counts")
plt.xlabel("Word Count")
plt.ylabel("Number of Articles")
plt.show()

1. What does the mean value represent in a dataset?

2. Why might a journalist want to visualize the distribution of article lengths?

3. Fill in the blank: To plot a histogram in matplotlib, use _ _ _.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 6