Descriptive Analysis
Learn to summarize and describe data with statistics
What is Descriptive Analysis?
Descriptive analysis answers the most fundamental question in data analytics: "What happened?"
It's the process of summarizing past events and current states using numbers and categories. Before you can understand why something happened or predict what will happen next, you first need to know what the current situation is. Descriptive analysis gives you that foundation.
🏀 Real-Life Analogy: Sports Highlights
Imagine you missed a basketball game and want to know what happened:
- Descriptive summary: "The Lakers scored 112 points, the Warriors scored 105. LeBron had 28 points, 8 rebounds, and 7 assists. The Lakers shot 48% from the field."
- What it does: Summarizes the game with key numbers (scores, stats, percentages)
- Why it's useful: You understand what happened without watching every minute
Analytics parallel: "Last month, we had 5,400 website visitors, 270 purchases, and a 5% conversion rate. Average order value was $82."
Descriptive analysis is the foundation of all analytics:
- It tells you the current state (sales, revenue, customer count)
- It summarizes large datasets into digestible numbers
- It's the starting point before you can analyze "why" or "what if"
- It's used in dashboards, reports, and performance reviews
Examples of descriptive analysis questions:
- "What was total revenue last quarter?"
- "How many customers signed up this month?"
- "What is the average order value?"
- "Which product category had the most sales?"
- "What percentage of customers are repeat buyers?"
⚡ Quick Check: Descriptive Analysis
Test your understanding:
1. Descriptive analysis answers the question 'What happened?'
2. Descriptive analysis is the starting point for all other types of analytics.
3. Descriptive analysis explains WHY something happened.
Summary Statistics: The Basics
Summary statistics are the building blocks of descriptive analysis. They condense large datasets into single, meaningful numbers.
1. Count
What it is: How many data points (rows/values) you have
When to use: When you need to know the total number of items, customers, transactions, etc.
Example: "We had 1,243 orders last month" (count of orders)
2. Sum
What it is: The total when you add all values together
When to use: When you want the grand total (revenue, units sold, hours worked)
Example: "Total revenue was $45,600" (sum of all order amounts)
3. Average (Mean)
What it is: The sum divided by the count—the "typical" value
When to use: When you want to know the central tendency or typical value
Example: "Average order value was $36.70" ($45,600 ÷ 1,243)
4. Minimum
What it is: The smallest value in the dataset
When to use: When you need to know the lower boundary or smallest case
Example: "Smallest order was $5.99"
5. Maximum
What it is: The largest value in the dataset
When to use: When you need to know the upper boundary or largest case
Example: "Largest order was $532.00"
Example: Applying Summary Statistics
Dataset: Daily website visitors for a week
| Day | Visitors |
|---|---|
| Monday | 850 |
| Tuesday | 920 |
| Wednesday | 875 |
| Thursday | 910 |
| Friday | 1,020 |
| Saturday | 1,340 |
| Sunday | 1,180 |
Summary statistics:
- Count: 7 days
- Sum: 7,095 total visitors
- Average: 1,014 visitors per day (7,095 ÷ 7)
- Minimum: 850 visitors (Monday)
- Maximum: 1,340 visitors (Saturday)
Insight: "We average about 1,000 visitors per day, with weekends seeing 30%+ more traffic than weekdays."
Understanding Averages: Mean, Median, Mode
The word "average" is actually imprecise—there are three types, each useful in different situations.
Mean (Arithmetic Average)
How to calculate: Add all values, divide by count
Formula: (Sum of all values) ÷ (Number of values)
When to use: When your data doesn't have extreme outliers
Example: Salaries: $40K, $45K, $50K, $55K, $60K
Mean
= ($250K ÷ 5) = $50K
Median (Middle Value)
How to calculate: Sort all values, pick the middle one
Formula: The value in the middle position (or average of two middle values if even count)
When to use: When you have outliers that would skew the mean
Example: Salaries: $40K, $45K, $50K, $55K,
$60K
Median = $50K (the middle value)
Mode (Most Common Value)
How to calculate: Find the value that appears most frequently
Formula: The value with the highest frequency
When to use: When you want to know what's most typical or common (especially for categorical data)
Example: Shoe sizes sold: 8, 9, 9, 9, 10, 10,
11
Mode = 9 (appears 3 times)
Why Mean Can Be Misleading: The Outlier Problem
Scenario: Salaries at a small company (5 employees)
Dataset 1: Without CEO
| Employee A | $40,000 |
| Employee B | $45,000 |
| Employee C | $50,000 |
| Employee D | $55,000 |
Mean: $47,500 | Median: $47,500
Both accurately represent the typical salary.
Dataset 2: With CEO
| Employee A | $40,000 |
| Employee B | $45,000 |
| Employee C | $50,000 |
| Employee D | $55,000 |
| CEO | $500,000 |
Mean: $138,000 | Median: $50,000
Problem with mean: It says the "average" employee makes $138K, but 4 out of 5 make less than $60K! The CEO's salary pulled the mean way up.
Median is better: $50,000 represents the typical employee's salary.
Rule of thumb: Use median when you have outliers (extreme high or low values). Use mean when values are relatively similar. Use mode for categorical data or when you care about the most common value.
✍️ Fill in the Blanks: Mean, Median, Mode
Complete the definitions:
Word Bank:
mean median mode outliers middle1. The is calculated by adding all values and dividing by the count.
2. The is the value when data is sorted.
3. The is the most frequently occurring value in a dataset.
4. Use median instead of mean when you have that would skew the average.
Frequency Analysis
Frequency analysis answers the question: "How often does this happen?"
Instead of calculating a single summary number, you count how many times each value (or category) appears in your dataset.
Example 1: Product Sales by Category
Question: Which product categories sell the most?
Data: 1,000 transactions over a month
| Category | Frequency (Count) | Percentage |
|---|---|---|
| Electronics | 420 | 42% |
| Clothing | 280 | 28% |
| Home & Garden | 180 | 18% |
| Books | 80 | 8% |
| Sports | 40 | 4% |
| Total | 1,000 | 100% |
Insight: "Electronics account for nearly half of all sales, while Sports is our smallest category."
Example 2: Customer Visits by Day of Week
Question: When do customers visit our store?
| Day | Visits | Percentage |
|---|---|---|
| Monday | 85 | 12% |
| Tuesday | 92 | 13% |
| Wednesday | 88 | 13% |
| Thursday | 95 | 14% |
| Friday | 120 | 17% |
| Saturday | 145 | 21% |
| Sunday | 70 | 10% |
| Total | 695 | 100% |
Insight: "Saturdays are our busiest day (21% of weekly visits), while Sundays are slowest. We should staff accordingly."
Example 3: Survey Responses
Question: "How satisfied are you with our service?" (1 = Very Dissatisfied, 5 = Very Satisfied)
| Rating | Count | Percentage |
|---|---|---|
| 1 - Very Dissatisfied | 12 | 6% |
| 2 - Dissatisfied | 18 | 9% |
| 3 - Neutral | 45 | 23% |
| 4 - Satisfied | 80 | 40% |
| 5 - Very Satisfied | 45 | 22% |
| Total | 200 | 100% |
Insight: "62% of customers are satisfied or very satisfied (ratings 4-5), while 15% are dissatisfied (ratings 1-2)."
Pro tip: Always include both counts and percentages in frequency tables. Counts show magnitude, percentages show proportion.
Distribution Analysis
Distribution analysis shows how values are spread out across your dataset. It goes beyond simple averages to reveal the shape and pattern of your data.
Key concepts:
- Range: The difference between the maximum and minimum values (shows the spread)
- Distribution shape: Are most values clustered around the average, or spread out evenly?
- Normal distribution: Bell curve—most values near the average, fewer at extremes
- Skewed distribution: Values bunched on one side with a long tail on the other
Example: Test Scores Distribution
Dataset: 50 students' test scores (0-100)
Summary statistics:
- Minimum: 52
- Maximum: 98
- Range: 46 points (98 - 52)
- Mean: 78
- Median: 79
Distribution (grouped by score range):
| Score Range | Count | Percentage |
|---|---|---|
| 90-100 (A) | 8 | 16% |
| 80-89 (B) | 18 | 36% |
| 70-79 (C) | 15 | 30% |
| 60-69 (D) | 7 | 14% |
| Below 60 (F) | 2 | 4% |
Shape: This is roughly normal—most students scored in the 70-89 range (66%), with fewer at the extremes.
Insight: "The class performed well overall, with two-thirds earning a B or C. Only 4% failed."
Normal Distribution
Shape: Bell curve—symmetric around the mean
Characteristics: Most values near the average, fewer at extremes
Example: Heights of adult men (most are 5'8"-5'10", fewer are very short or very tall)
Implication: Mean and median are similar and both represent the typical value well
Right-Skewed Distribution
Shape: Bunched on the left with a long tail to the right
Characteristics: Most values are low with a few very high outliers
Example: Income (most people earn moderate incomes, a few earn millions)
Implication: Mean > Median (outliers pull the mean higher). Use median for "typical" value.
Left-Skewed Distribution
Shape: Bunched on the right with a long tail to the left
Characteristics: Most values are high with a few very low outliers
Example: Age of retirement (most people retire 60-70, a few retire early at 40-50)
Implication: Mean < Median (outliers pull the mean lower). Use median for "typical" value.
Why distribution matters: Two datasets can have the same mean but completely different distributions. Understanding the shape helps you choose the right summary statistic and interpret what's happening.
Interactive Descriptive Statistics Explorer
Practice calculating descriptive statistics with this interactive tool. Click each button to see how the statistic is calculated.
Dataset: Student Test Scores
Here are test scores for 12 students:
Aggregation and Grouping
Aggregation combines multiple rows into a single summary value. Grouping lets you create separate summaries for different categories.
1. Simple Aggregation (No Grouping)
Question: "What was total sales?"
Process: Sum all sales values across all rows
Result: "Total sales: $125,000"
2. Single-Level Grouping
Question: "What was total sales by region?"
Process: Group rows by region, then sum sales for each group
Result: "North: $45K, South: $38K, East: $25K, West: $17K"
3. Multi-Level Grouping
Question: "What was total sales by region and product?"
Process: Group by region, then by product within each region, then sum
Result: "North-ProductA: $20K, North-ProductB: $25K, South-ProductA: $18K..."
Example: Sales Aggregation and Grouping
Raw data (simplified):
| Region | Product | Month | Sales |
|---|---|---|---|
| North | Widget | Jan | $5,000 |
| North | Widget | Feb | $6,200 |
| North | Gadget | Jan | $3,800 |
| South | Widget | Jan | $4,500 |
| South | Gadget | Jan | $2,900 |
| ... (many more rows) | |||
Aggregated by Region (Total Sales):
| Region | Total Sales |
|---|---|
| North | $45,000 |
| South | $38,000 |
| East | $25,000 |
| West | $17,000 |
Aggregated by Region and Product:
| Region | Product | Total Sales |
|---|---|---|
| North | Widget | $28,000 |
| North | Gadget | $17,000 |
| South | Widget | $22,000 |
| South | Gadget | $16,000 |
| ... (continued for East and West) | ||
Insight: "Widgets outsell Gadgets in every region, with the North region being our top market for both products."
Practice: Calculate Descriptive Statistics
Test your understanding with these exercises. Try to calculate the answers manually first, then check your work.
Exercise 1: Daily Sales
A coffee shop's sales for one week:
Data: $420, $380, $410, $450, $490, $620, $580
Calculate:
- a) Total sales for the week
- b) Average daily sales
- c) Minimum and maximum
- d) Range
Exercise 2: Finding the Median
Customer ages: 22, 35, 28, 42, 31, 29, 38
Calculate:
- a) The median age
- b) The mean age
- c) Are they similar or different? Why?
Exercise 3: Mean vs Median with Outliers
House prices in a neighborhood: $200K, $220K, $195K, $210K, $1,500K
Calculate:
- a) The mean price
- b) The median price
- c) Which better represents the "typical" house price? Why?
Exercise 4: Frequency Analysis
T-shirt sizes sold today: S, M, M, L, M, XL, M, L, M, S, M, L, M, M
Calculate:
- a) Create a frequency table
- b) What is the mode (most common size)?
- c) What percentage of sales were Medium?
Exercise 5: Grouping and Aggregation
Sales by salesperson and region:
| Salesperson | Region | Sales |
|---|---|---|
| Alice | North | $12,000 |
| Bob | North | $15,000 |
| Alice | South | $9,000 |
| Bob | South | $11,000 |
| Carol | North | $13,000 |
Calculate:
- a) Total sales by region
- b) Total sales by salesperson
- c) Overall total sales
Exercise 6: Choosing the Right Statistic
For each scenario, identify which summary statistic is most appropriate:
- a) Finding the typical price in a neighborhood with one mansion and many modest homes
- b) Determining the most popular product color
- c) Calculating total revenue for the quarter
- d) Finding the average when all values are relatively similar
Common Mistakes in Descriptive Analysis
Even simple descriptive analysis can go wrong. Here are the most common mistakes and how to avoid them.
❌ Mistake 1: Using Mean When Median is Better
Problem: Using average (mean) when you have extreme outliers
Example: "Average salary is $200K" when most employees make $50K but the CEO makes $2M
Fix: Use median for data with outliers. Check the distribution before choosing a statistic.
❌ Mistake 2: Ignoring Outliers
Problem: Not acknowledging extreme values that might skew your results or indicate errors
Example: A customer "age" of 250 is clearly a data entry error, but it pulls the average age way up
Fix: Always check min/max values. Investigate outliers—are they errors, or legitimate extreme cases?
❌ Mistake 3: Not Providing Context
Problem: Reporting numbers without context that helps interpret them
Example: "We had 500 website visitors" — Is that good or bad? Compared to what?
Fix: Provide comparisons: "500 visitors, up 20% from last month" or "500 visitors, below our goal of 750"
❌ Mistake 4: Comparing Incompatible Statistics
Problem: Comparing statistics that aren't directly comparable
Example: "Product A sold 500 units and Product B made $10,000" — You can't compare units to dollars
Fix: Use the same metric: "Product A made $8,000, Product B made $10,000" OR "Product A sold 500 units, Product B sold 400 units"
Key Takeaways
- Descriptive analysis answers "What happened?" by summarizing past events
- Five key summary statistics: Count, Sum, Mean, Min, Max
- Three types of averages: Mean (arithmetic average), Median (middle value), Mode (most common)
- Use median when you have outliers that would skew the mean
- Frequency analysis counts how often each value/category appears
- Distribution shows the spread: Normal (bell curve) vs. skewed (long tail)
- Aggregation and grouping: Summarize by categories (region, product, time)
- Always provide context: Numbers are more meaningful with comparisons
📝 Knowledge Check
1. Descriptive analysis primarily answers which question?
2. Dataset: 10, 12, 15, 18, 20. What is the mean?
3. When should you use median instead of mean?
4. Dataset: 5, 8, 8, 8, 12, 15. What is the mode?
5. What does frequency analysis tell you?
6. Dataset: Min = 20, Max = 100. What is the range?
7. What is aggregation in descriptive analysis?
8. Why is providing context important in descriptive analysis?
9. Dataset: 3, 7, 9, 12, 15. What is the median?
10. What is the main difference between single-level and multi-level grouping?