Chapter 10

Descriptive Analysis

Learn to summarize and describe data with statistics

What is Descriptive Analysis?

Descriptive analysis answers the most fundamental question in data analytics: "What happened?"

It's the process of summarizing past events and current states using numbers and categories. Before you can understand why something happened or predict what will happen next, you first need to know what the current situation is. Descriptive analysis gives you that foundation.

🏀 Real-Life Analogy: Sports Highlights

Imagine you missed a basketball game and want to know what happened:

  • Descriptive summary: "The Lakers scored 112 points, the Warriors scored 105. LeBron had 28 points, 8 rebounds, and 7 assists. The Lakers shot 48% from the field."
  • What it does: Summarizes the game with key numbers (scores, stats, percentages)
  • Why it's useful: You understand what happened without watching every minute

Analytics parallel: "Last month, we had 5,400 website visitors, 270 purchases, and a 5% conversion rate. Average order value was $82."

Descriptive analysis is the foundation of all analytics:

  • It tells you the current state (sales, revenue, customer count)
  • It summarizes large datasets into digestible numbers
  • It's the starting point before you can analyze "why" or "what if"
  • It's used in dashboards, reports, and performance reviews

Examples of descriptive analysis questions:

  • "What was total revenue last quarter?"
  • "How many customers signed up this month?"
  • "What is the average order value?"
  • "Which product category had the most sales?"
  • "What percentage of customers are repeat buyers?"

⚡ Quick Check: Descriptive Analysis

Test your understanding:

1. Descriptive analysis answers the question 'What happened?'

2. Descriptive analysis is the starting point for all other types of analytics.

3. Descriptive analysis explains WHY something happened.

Summary Statistics: The Basics

Summary statistics are the building blocks of descriptive analysis. They condense large datasets into single, meaningful numbers.

1. Count

What it is: How many data points (rows/values) you have

When to use: When you need to know the total number of items, customers, transactions, etc.

Example: "We had 1,243 orders last month" (count of orders)

2. Sum

What it is: The total when you add all values together

When to use: When you want the grand total (revenue, units sold, hours worked)

Example: "Total revenue was $45,600" (sum of all order amounts)

3. Average (Mean)

What it is: The sum divided by the count—the "typical" value

When to use: When you want to know the central tendency or typical value

Example: "Average order value was $36.70" ($45,600 ÷ 1,243)

4. Minimum

What it is: The smallest value in the dataset

When to use: When you need to know the lower boundary or smallest case

Example: "Smallest order was $5.99"

5. Maximum

What it is: The largest value in the dataset

When to use: When you need to know the upper boundary or largest case

Example: "Largest order was $532.00"

Example: Applying Summary Statistics

Dataset: Daily website visitors for a week

Day Visitors
Monday 850
Tuesday 920
Wednesday 875
Thursday 910
Friday 1,020
Saturday 1,340
Sunday 1,180

Summary statistics:

  • Count: 7 days
  • Sum: 7,095 total visitors
  • Average: 1,014 visitors per day (7,095 ÷ 7)
  • Minimum: 850 visitors (Monday)
  • Maximum: 1,340 visitors (Saturday)

Insight: "We average about 1,000 visitors per day, with weekends seeing 30%+ more traffic than weekdays."

Understanding Averages: Mean, Median, Mode

The word "average" is actually imprecise—there are three types, each useful in different situations.

Mean (Arithmetic Average)

How to calculate: Add all values, divide by count

Formula: (Sum of all values) ÷ (Number of values)

When to use: When your data doesn't have extreme outliers

Example: Salaries: $40K, $45K, $50K, $55K, $60K
Mean = ($250K ÷ 5) = $50K

Median (Middle Value)

How to calculate: Sort all values, pick the middle one

Formula: The value in the middle position (or average of two middle values if even count)

When to use: When you have outliers that would skew the mean

Example: Salaries: $40K, $45K, $50K, $55K, $60K
Median = $50K (the middle value)

Mode (Most Common Value)

How to calculate: Find the value that appears most frequently

Formula: The value with the highest frequency

When to use: When you want to know what's most typical or common (especially for categorical data)

Example: Shoe sizes sold: 8, 9, 9, 9, 10, 10, 11
Mode = 9 (appears 3 times)

Why Mean Can Be Misleading: The Outlier Problem

Scenario: Salaries at a small company (5 employees)

Dataset 1: Without CEO

Employee A $40,000
Employee B $45,000
Employee C $50,000
Employee D $55,000

Mean: $47,500 | Median: $47,500

Both accurately represent the typical salary.

Dataset 2: With CEO

Employee A $40,000
Employee B $45,000
Employee C $50,000
Employee D $55,000
CEO $500,000

Mean: $138,000 | Median: $50,000

Problem with mean: It says the "average" employee makes $138K, but 4 out of 5 make less than $60K! The CEO's salary pulled the mean way up.

Median is better: $50,000 represents the typical employee's salary.

Rule of thumb: Use median when you have outliers (extreme high or low values). Use mean when values are relatively similar. Use mode for categorical data or when you care about the most common value.

✍️ Fill in the Blanks: Mean, Median, Mode

Complete the definitions:

Word Bank:

mean median mode outliers middle

1. The is calculated by adding all values and dividing by the count.

2. The is the value when data is sorted.

3. The is the most frequently occurring value in a dataset.

4. Use median instead of mean when you have that would skew the average.

Frequency Analysis

Frequency analysis answers the question: "How often does this happen?"

Instead of calculating a single summary number, you count how many times each value (or category) appears in your dataset.

Example 1: Product Sales by Category

Question: Which product categories sell the most?

Data: 1,000 transactions over a month

Category Frequency (Count) Percentage
Electronics 420 42%
Clothing 280 28%
Home & Garden 180 18%
Books 80 8%
Sports 40 4%
Total 1,000 100%

Insight: "Electronics account for nearly half of all sales, while Sports is our smallest category."

Example 2: Customer Visits by Day of Week

Question: When do customers visit our store?

Day Visits Percentage
Monday 85 12%
Tuesday 92 13%
Wednesday 88 13%
Thursday 95 14%
Friday 120 17%
Saturday 145 21%
Sunday 70 10%
Total 695 100%

Insight: "Saturdays are our busiest day (21% of weekly visits), while Sundays are slowest. We should staff accordingly."

Example 3: Survey Responses

Question: "How satisfied are you with our service?" (1 = Very Dissatisfied, 5 = Very Satisfied)

Rating Count Percentage
1 - Very Dissatisfied 12 6%
2 - Dissatisfied 18 9%
3 - Neutral 45 23%
4 - Satisfied 80 40%
5 - Very Satisfied 45 22%
Total 200 100%

Insight: "62% of customers are satisfied or very satisfied (ratings 4-5), while 15% are dissatisfied (ratings 1-2)."

Pro tip: Always include both counts and percentages in frequency tables. Counts show magnitude, percentages show proportion.

Distribution Analysis

Distribution analysis shows how values are spread out across your dataset. It goes beyond simple averages to reveal the shape and pattern of your data.

Key concepts:

  • Range: The difference between the maximum and minimum values (shows the spread)
  • Distribution shape: Are most values clustered around the average, or spread out evenly?
  • Normal distribution: Bell curve—most values near the average, fewer at extremes
  • Skewed distribution: Values bunched on one side with a long tail on the other

Example: Test Scores Distribution

Dataset: 50 students' test scores (0-100)

Summary statistics:

  • Minimum: 52
  • Maximum: 98
  • Range: 46 points (98 - 52)
  • Mean: 78
  • Median: 79

Distribution (grouped by score range):

Score Range Count Percentage
90-100 (A) 8 16%
80-89 (B) 18 36%
70-79 (C) 15 30%
60-69 (D) 7 14%
Below 60 (F) 2 4%

Shape: This is roughly normal—most students scored in the 70-89 range (66%), with fewer at the extremes.

Insight: "The class performed well overall, with two-thirds earning a B or C. Only 4% failed."

Normal Distribution

Shape: Bell curve—symmetric around the mean

Characteristics: Most values near the average, fewer at extremes

Example: Heights of adult men (most are 5'8"-5'10", fewer are very short or very tall)

Implication: Mean and median are similar and both represent the typical value well

Right-Skewed Distribution

Shape: Bunched on the left with a long tail to the right

Characteristics: Most values are low with a few very high outliers

Example: Income (most people earn moderate incomes, a few earn millions)

Implication: Mean > Median (outliers pull the mean higher). Use median for "typical" value.

Left-Skewed Distribution

Shape: Bunched on the right with a long tail to the left

Characteristics: Most values are high with a few very low outliers

Example: Age of retirement (most people retire 60-70, a few retire early at 40-50)

Implication: Mean < Median (outliers pull the mean lower). Use median for "typical" value.

Why distribution matters: Two datasets can have the same mean but completely different distributions. Understanding the shape helps you choose the right summary statistic and interpret what's happening.

Interactive Descriptive Statistics Explorer

Practice calculating descriptive statistics with this interactive tool. Click each button to see how the statistic is calculated.

Dataset: Student Test Scores

Here are test scores for 12 students:

72 85 68 92 78 85 90 76 82 88 70 85

Aggregation and Grouping

Aggregation combines multiple rows into a single summary value. Grouping lets you create separate summaries for different categories.

1. Simple Aggregation (No Grouping)

Question: "What was total sales?"

Process: Sum all sales values across all rows

Result: "Total sales: $125,000"

2. Single-Level Grouping

Question: "What was total sales by region?"

Process: Group rows by region, then sum sales for each group

Result: "North: $45K, South: $38K, East: $25K, West: $17K"

3. Multi-Level Grouping

Question: "What was total sales by region and product?"

Process: Group by region, then by product within each region, then sum

Result: "North-ProductA: $20K, North-ProductB: $25K, South-ProductA: $18K..."

Example: Sales Aggregation and Grouping

Raw data (simplified):

Region Product Month Sales
North Widget Jan $5,000
North Widget Feb $6,200
North Gadget Jan $3,800
South Widget Jan $4,500
South Gadget Jan $2,900
... (many more rows)

Aggregated by Region (Total Sales):

Region Total Sales
North $45,000
South $38,000
East $25,000
West $17,000

Aggregated by Region and Product:

Region Product Total Sales
North Widget $28,000
North Gadget $17,000
South Widget $22,000
South Gadget $16,000
... (continued for East and West)

Insight: "Widgets outsell Gadgets in every region, with the North region being our top market for both products."

Practice: Calculate Descriptive Statistics

Test your understanding with these exercises. Try to calculate the answers manually first, then check your work.

Exercise 1: Daily Sales

A coffee shop's sales for one week:

Data: $420, $380, $410, $450, $490, $620, $580

Calculate:

  • a) Total sales for the week
  • b) Average daily sales
  • c) Minimum and maximum
  • d) Range

Exercise 2: Finding the Median

Customer ages: 22, 35, 28, 42, 31, 29, 38

Calculate:

  • a) The median age
  • b) The mean age
  • c) Are they similar or different? Why?

Exercise 3: Mean vs Median with Outliers

House prices in a neighborhood: $200K, $220K, $195K, $210K, $1,500K

Calculate:

  • a) The mean price
  • b) The median price
  • c) Which better represents the "typical" house price? Why?

Exercise 4: Frequency Analysis

T-shirt sizes sold today: S, M, M, L, M, XL, M, L, M, S, M, L, M, M

Calculate:

  • a) Create a frequency table
  • b) What is the mode (most common size)?
  • c) What percentage of sales were Medium?

Exercise 5: Grouping and Aggregation

Sales by salesperson and region:

Salesperson Region Sales
Alice North $12,000
Bob North $15,000
Alice South $9,000
Bob South $11,000
Carol North $13,000

Calculate:

  • a) Total sales by region
  • b) Total sales by salesperson
  • c) Overall total sales

Exercise 6: Choosing the Right Statistic

For each scenario, identify which summary statistic is most appropriate:

  • a) Finding the typical price in a neighborhood with one mansion and many modest homes
  • b) Determining the most popular product color
  • c) Calculating total revenue for the quarter
  • d) Finding the average when all values are relatively similar

Common Mistakes in Descriptive Analysis

Even simple descriptive analysis can go wrong. Here are the most common mistakes and how to avoid them.

❌ Mistake 1: Using Mean When Median is Better

Problem: Using average (mean) when you have extreme outliers

Example: "Average salary is $200K" when most employees make $50K but the CEO makes $2M

Fix: Use median for data with outliers. Check the distribution before choosing a statistic.

❌ Mistake 2: Ignoring Outliers

Problem: Not acknowledging extreme values that might skew your results or indicate errors

Example: A customer "age" of 250 is clearly a data entry error, but it pulls the average age way up

Fix: Always check min/max values. Investigate outliers—are they errors, or legitimate extreme cases?

❌ Mistake 3: Not Providing Context

Problem: Reporting numbers without context that helps interpret them

Example: "We had 500 website visitors" — Is that good or bad? Compared to what?

Fix: Provide comparisons: "500 visitors, up 20% from last month" or "500 visitors, below our goal of 750"

❌ Mistake 4: Comparing Incompatible Statistics

Problem: Comparing statistics that aren't directly comparable

Example: "Product A sold 500 units and Product B made $10,000" — You can't compare units to dollars

Fix: Use the same metric: "Product A made $8,000, Product B made $10,000" OR "Product A sold 500 units, Product B sold 400 units"

Key Takeaways

  • Descriptive analysis answers "What happened?" by summarizing past events
  • Five key summary statistics: Count, Sum, Mean, Min, Max
  • Three types of averages: Mean (arithmetic average), Median (middle value), Mode (most common)
  • Use median when you have outliers that would skew the mean
  • Frequency analysis counts how often each value/category appears
  • Distribution shows the spread: Normal (bell curve) vs. skewed (long tail)
  • Aggregation and grouping: Summarize by categories (region, product, time)
  • Always provide context: Numbers are more meaningful with comparisons

📝 Knowledge Check

1. Descriptive analysis primarily answers which question?

2. Dataset: 10, 12, 15, 18, 20. What is the mean?

3. When should you use median instead of mean?

4. Dataset: 5, 8, 8, 8, 12, 15. What is the mode?

5. What does frequency analysis tell you?

6. Dataset: Min = 20, Max = 100. What is the range?

7. What is aggregation in descriptive analysis?

8. Why is providing context important in descriptive analysis?

9. Dataset: 3, 7, 9, 12, 15. What is the median?

10. What is the main difference between single-level and multi-level grouping?