DAY 10: SQL - Subqueries

Guru Charan Pathalla

DAY 10: SQL - Subqueries

Learning Objectives: - Use subqueries in WHERE clause - Create subqueries in SELECT - Understand IN and EXISTS operators - Learn correlated vs non-correlated subqueries

Topics Covered (60 min):

1. WHERE Subquery (Non-Correlated) - SELECT * FROM Orders WHERE Sales > (SELECT AVG(Sales) FROM Orders) - Find records above/below aggregate - Inner query runs once

2. IN Operator - WHERE CustomerID IN (SELECT ID FROM Customers WHERE State='CA') - Tests if value is in subquery results - Returns multiple values

3. EXISTS Operator - WHERE EXISTS (SELECT 1 FROM OrderDetails WHERE OrderID = Orders.ID) - Tests if subquery returns ANY rows - Usually faster than IN for large datasets - Returns TRUE/FALSE

4. SELECT Subquery (Scalar) - SELECT OrderID, (SELECT Name FROM Customers WHERE ID=Orders.CustomerID) AS Customer - Must return single value - Adds calculated column

5. Correlated Subquery - References outer query - WHERE Salary > (SELECT AVG(Salary) FROM Employees E2 WHERE E2.Department = E1.Department) - Runs for each row (slower)

Video Resources:

  1. CASE Statement and UNION SQL interview questions (10:00)
    https://www.youtube.com/watch?v=F2JJWM40Dh4

  2. How to Use SQL Case Statements (13:40)
    https://www.youtube.com/watch?v=iq-geVaMOos

  3. SQL Tutorial #36 - SQL UNION Operator (7:49)
    https://www.youtube.com/watch?v=7QaIPZvu1NI

AI Learning Prompts (Concept Understanding Only):

Understanding Subqueries Basics:

Explain SQL subqueries:

1. What is a subquery and why use it?
2. Where can subqueries be placed (WHERE, SELECT, FROM)?
3. How do I write a subquery: SELECT ... WHERE col > (SELECT ...)?
4. Show me examples using Employee data (find employees with salary > average)
5. What's the difference between subquery and JOIN?
6. When should I use subquery vs JOIN?

Use HR or academic data for examples (not sales).

Understanding IN and EXISTS:

Explain IN and EXISTS with subqueries:

1. How does IN operator work with subqueries?
2. What's the syntax: WHERE column IN (SELECT ...)?
3. How does EXISTS differ from IN?
4. When is EXISTS more efficient than IN?
5. Show me examples: Find students who are enrolled in at least one class
6. What's the difference between NOT IN and NOT EXISTS?

Use school enrollment or membership data for examples.

Understanding Correlated Subqueries:

Explain correlated subqueries:

1. What makes a subquery "correlated"?
2. How does it reference the outer query?
3. Show me examples: Find employees earning more than average in their department
4. Why are correlated subqueries slower?
5. Can I rewrite correlated subqueries as JOINs?

Use departmental or grouped data for examples.

Understanding Subqueries in SELECT:

Explain subqueries in the SELECT clause:

1. How do I use subqueries to add calculated columns?
2. What are scalar subqueries (returning single value)?
3. Show me examples: Add "Department Average Salary" column to employee list
4. When should I use subquery vs JOIN for this?

Use organizational hierarchy examples.

Guided Practice (60 min):

📝 How to approach Guided Practice:

  • Follow step-by-step: Work through each exercise in order
  • Use provided resources: Reference the video tutorials and AI prompts if you get stuck
  • Verify your work: Check your results match expected outcomes
  • Take notes: Document key learnings and shortcuts you discover
  • Ask for help: Use AI assistants to clarify concepts you don't understand

Exercise 1: Find Orders with Sales Above Average (Basic Subquery in WHERE)

Goal: Learn to use subqueries in the WHERE clause to compare against aggregate values.

Step-by-Step Instructions:

  1. Open your SQL editor (pgAdmin, MySQL Workbench, or online SQL environment)
  2. Start with the inner subquery first to understand what value it returns:
    SELECT AVG(Sales) FROM Orders;
    This calculates the average sales across all orders (e.g., $500)
  3. Now write the full query with the subquery in the WHERE clause:
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM Orders
    WHERE Sales > (SELECT AVG(Sales) FROM Orders)
    ORDER BY Sales DESC;
  4. Execute the query (press F5 or click Execute)
  5. Analyze the results: Notice all returned orders have sales greater than the average

💡 Tips:

  • Test the subquery separately first - Run just the SELECT AVG(Sales)... to see what value it returns
  • Subquery runs once: Non-correlated subqueries execute only one time, then the result is used for comparison
  • Parentheses are required: Always wrap subqueries in parentheses (SELECT ...)
  • ORDER BY helps verification: Sort by Sales DESC to easily verify all values are above average
  • Alternative approach: You could use a CTE or JOIN with a derived table, but subqueries are cleaner for this use case

✅ Expected Result: A list of orders where Sales value is higher than the average (e.g., if average is $500, you'll see only orders with $501+)

🔄 Connection to Previous Learning: This is similar to Excel's AVERAGEIF from Day 3, but more flexible - you're filtering based on a calculated aggregate.

Exercise 2: Select Customers Who Have Placed Orders (Using IN Subquery)

Goal: Learn to use IN operator with subqueries to filter based on a list of values.

Step-by-Step Instructions:

  1. Understand the scenario: You have a Customers table and an Orders table. You want to find only customers who have placed at least one order
  2. First, test the subquery alone to see what it returns:
    SELECT DISTINCT CustomerID FROM Orders;
    This returns a list of all customer IDs that appear in Orders (e.g., 101, 105, 108...)
  3. Now write the full query using IN:
    SELECT CustomerID, CustomerName, State, City
    FROM Customers
    WHERE CustomerID IN (SELECT DISTINCT CustomerID FROM Orders)
    ORDER BY CustomerName;
  4. Execute and verify - All returned customers should have orders
  5. Try the opposite (NOT IN) to find customers without orders:
    SELECT CustomerID, CustomerName, State, City
    FROM Customers
    WHERE CustomerID NOT IN (SELECT DISTINCT CustomerID FROM Orders);

💡 Tips:

  • IN vs JOIN: This could also be done with INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID, but IN is more intuitive for this use case
  • Use DISTINCT in subquery: Prevents duplicate customer IDs in the list (though IN handles duplicates fine)
  • NULL handling: Be careful with NOT IN if the subquery can return NULL - it can cause unexpected results. Use WHERE CustomerID IS NOT NULL in subquery
  • Performance: For large datasets, EXISTS might be faster than IN
  • Debugging tip: Run the subquery separately to verify it returns the expected list

✅ Expected Result: A list of customers from the Customers table who have at least one corresponding order in the Orders table

🔄 Connection to Previous Learning: Similar to Excel VLOOKUP filtering from Day 5, but here you're finding matching records across tables.

Exercise 3: Find Products That Have Been Sold (Using EXISTS)

Goal: Learn to use EXISTS operator for existence checks - often more efficient than IN.

Step-by-Step Instructions:

  1. Understand EXISTS: Returns TRUE if the subquery returns any rows, FALSE if it returns zero rows
  2. Write the EXISTS query to find products that have been sold:
    SELECT ProductID, ProductName, Category, Price
    FROM Products P
    WHERE EXISTS (
        SELECT 1
        FROM OrderDetails OD
        WHERE OD.ProductID = P.ProductID
    )
    ORDER BY ProductName;
  3. Note the correlation: The subquery references P.ProductID from the outer query - this makes it a correlated subquery
  4. Execute and analyze - You'll see only products that appear in OrderDetails
  5. Try NOT EXISTS to find products that have never been sold:
    SELECT ProductID, ProductName, Category, Price
    FROM Products P
    WHERE NOT EXISTS (
        SELECT 1
        FROM OrderDetails OD
        WHERE OD.ProductID = P.ProductID
    );
  6. Compare with IN approach:
    -- Alternative using IN
    SELECT ProductID, ProductName, Category, Price
    FROM Products
    WHERE ProductID IN (SELECT ProductID FROM OrderDetails);
    Both return same results, but EXISTS is typically faster for large datasets

💡 Tips:

  • SELECT 1 vs SELECT *: In EXISTS, we use SELECT 1 because we only care if rows exist, not their values - more efficient
  • Performance advantage: EXISTS stops searching after finding first match; IN evaluates entire subquery
  • Correlated subquery: Notice OD.ProductID = P.ProductID - the subquery references outer query, so it runs for each product
  • Readability: EXISTS is more readable for "does this record exist" questions
  • When to use: Prefer EXISTS over IN when checking existence in large tables or when subquery might return NULLs

✅ Expected Result: A list of products from the Products table that have at least one sale recorded in OrderDetails (or with NOT EXISTS, products with zero sales)

🔄 Connection to Previous Learning: This relates to Day 6 JOIN concepts - you're essentially checking for matching records between tables.

Exercise 4: Add Customer Name Using Subquery in SELECT (Scalar Subquery)

Goal: Learn to use subqueries in the SELECT clause to add calculated columns.

Step-by-Step Instructions:

  1. Understand scalar subqueries: A subquery in SELECT must return exactly one value (one row, one column)
  2. Start with a basic Orders query:
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM Orders;
    This shows orders but only CustomerID, not the name
  3. Add customer name using a scalar subquery:
    SELECT
        OrderID,
        CustomerID,
        (SELECT CustomerName
         FROM Customers C
         WHERE C.CustomerID = Orders.CustomerID) AS CustomerName,
        Sales,
        OrderDate
    FROM Orders
    ORDER BY OrderDate DESC
    LIMIT 10;
  4. Execute and verify - Each order now shows the customer name retrieved from Customers table
  5. Add multiple scalar subqueries:
    SELECT
        OrderID,
        (SELECT CustomerName FROM Customers C WHERE C.CustomerID = Orders.CustomerID) AS CustomerName,
        (SELECT State FROM Customers C WHERE C.CustomerID = Orders.CustomerID) AS CustomerState,
        Sales,
        OrderDate
    FROM Orders
    WHERE Sales > 1000
    ORDER BY Sales DESC;
  6. Compare with JOIN approach:
    -- Alternative using JOIN (usually more efficient)
    SELECT
        O.OrderID,
        C.CustomerName,
        C.State AS CustomerState,
        O.Sales,
        O.OrderDate
    FROM Orders O
    INNER JOIN Customers C ON O.CustomerID = C.CustomerID
    WHERE O.Sales > 1000
    ORDER BY O.Sales DESC;
    Both approaches work, but JOIN is typically better for performance

💡 Tips:

  • Must return single value: If subquery returns multiple rows or columns, you'll get an error
  • Correlated subquery: Notice C.CustomerID = Orders.CustomerID - runs once per order row
  • Performance consideration: This runs the subquery for every row in Orders - can be slow for large datasets
  • When to use: Use scalar subqueries when you need just one or two extra columns; use JOINs when you need multiple columns from related table
  • NULL handling: If no matching customer found, subquery returns NULL
  • Alternative approach: For better performance, use JOIN instead (shown in step 6)

✅ Expected Result: Orders list with CustomerName column added, showing the name from Customers table for each order

🔄 Connection to Previous Learning: This is the SQL equivalent of Excel's VLOOKUP from Day 5 - looking up related data from another table.

Exercise 5: Find Top 10% of Orders by Revenue (Advanced Subquery)

Goal: Learn to use subqueries with percentiles and ranking logic for advanced filtering.

Step-by-Step Instructions:

  1. Understand the goal: Find orders in the top 10% by sales value (e.g., if you have 1000 orders, find the top 100)
  2. Method 1 - Using LIMIT with subquery (for MySQL/PostgreSQL):
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM Orders
    WHERE Sales >= (
        SELECT Sales
        FROM Orders
        ORDER BY Sales DESC
        LIMIT 1 OFFSET (SELECT COUNT(*) * 0.1 FROM Orders)
    )
    ORDER BY Sales DESC;
    This finds the sales value at the 90th percentile, then returns all orders >= that value
  3. Method 2 - Using PERCENT_RANK window function (more elegant):
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM (
        SELECT
            OrderID,
            CustomerID,
            Sales,
            OrderDate,
            PERCENT_RANK() OVER (ORDER BY Sales DESC) AS percentile
        FROM Orders
    ) AS RankedOrders
    WHERE percentile <= 0.10
    ORDER BY Sales DESC;
    This calculates the percentile for each order, then filters to top 10%
  4. Method 3 - Using COUNT subquery (works in all SQL databases):
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM Orders O1
    WHERE (
        SELECT COUNT(*)
        FROM Orders O2
        WHERE O2.Sales > O1.Sales
    ) <= (SELECT COUNT(*) * 0.10 FROM Orders)
    ORDER BY Sales DESC;
    This counts how many orders have higher sales than current order, keeps those in top 10%
  5. Execute your preferred method and analyze the results
  6. Verify the percentage:
    -- Count total orders
    SELECT COUNT(*) AS TotalOrders FROM Orders;
    
    -- Count orders in your result
    SELECT COUNT(*) AS Top10PercentOrders
    FROM [your query here];
    The second count should be ~10% of the first
  7. Extend to find bottom 10%:
    -- Using Method 2, just change the WHERE clause
    SELECT OrderID, CustomerID, Sales, OrderDate
    FROM (
        SELECT
            OrderID,
            CustomerID,
            Sales,
            OrderDate,
            PERCENT_RANK() OVER (ORDER BY Sales ASC) AS percentile
        FROM Orders
    ) AS RankedOrders
    WHERE percentile <= 0.10
    ORDER BY Sales ASC;

💡 Tips:

  • Choose the right method: Method 2 (PERCENT_RANK) is most efficient for modern databases; Method 3 works everywhere but slower
  • Subquery in FROM: Method 2 uses a subquery in FROM clause (called derived table or inline view) - very powerful technique
  • Window functions: PERCENT_RANK is a window function (preview of Day 12) - provides percentile without grouping
  • Performance: Method 1 and 2 are fast; Method 3 with correlated subquery can be slow on large datasets
  • Exact vs approximate: Depending on ties in Sales values, you might get slightly more or less than exactly 10%
  • Database differences: LIMIT syntax varies (LIMIT for MySQL/PostgreSQL, TOP for SQL Server, FETCH FIRST for Oracle)

✅ Expected Result: A list of orders representing the top 10% by sales value - if you have 1000 orders, you should see approximately 100 orders with the highest sales values

🔄 Connection to Previous Learning: This combines concepts from Day 8 (aggregates) with subqueries to create sophisticated filtering - similar to Excel's top 10% filtering with AutoFilter from Day 3.

🎮 Interactive SQL Playground 1: Above Average Sales

Practice finding orders with sales above average. Try modifying the query to find below-average sales or use different aggregates!

🎮 Interactive SQL Playground 2: Customers with Orders (IN)

Find customers who have placed orders using IN operator. Try using NOT IN to find customers without orders!

🎮 Interactive SQL Playground 3: Products Sold (EXISTS)

Find products that have been sold using EXISTS. Compare performance with IN approach!

🎮 Interactive SQL Playground 4: Scalar Subquery in SELECT

Add customer name to orders using scalar subquery. Try adding more columns like State or City!

🎮 Interactive SQL Playground 5: Top 10% by Revenue

Find orders in the top 10% by sales. Experiment with different percentiles or find bottom performers!

Independent Practice (60 min):

💡 Hints for Independent Practice:

  • Start simple: Begin with basic variations before attempting complex modifications
  • Break it down: Divide complex problems into smaller, manageable steps
  • Test incrementally: Verify each step works before moving to the next
  • Use documentation: Refer to official docs or tutorials when needed
  • Experiment: Try different approaches to find what works best
  1. Customers with above-average order value
  2. Products in top 3 revenue categories (subquery)
  3. Orders from California customers (IN subquery)
  4. Customers with more than 5 orders (subquery + HAVING)
  5. Products never ordered (NOT EXISTS)
  6. Orders above category average (correlated)
  7. High-value customers (top 20% by total sales)
  8. Products with above-average profit margin
  9. Regions performing above company average
  10. Customers with orders in last 30 days (subquery + date)
  11. Products in profitable categories only
  12. Orders from best customers (defined by criteria)

Daily Assignment (60 min):

Expected Outcomes:

✓ Write subqueries in various clauses
✓ Use IN and EXISTS effectively
✓ Understand when to use subqueries vs JOINs
✓ Submit: Day10_SQL_Practice.sql

📝 Day 10 Quick Quiz - Test Your Knowledge!

Q1. What is a subquery?

Q2. Where can subqueries be used?

Q3. What is a correlated subquery?

Q4. What does the IN operator do with a subquery?

Q5. Which is more efficient for checking existence?


Have feedback or questions?