Why We Need GROUP BY
Imagine this question: "What is the total earned by each participant in Q1 2026?" Without GROUP BY, you'd get one row per result transaction. With GROUP BY, you collapse those rows into one row per participant with a SUM of their earnings.
GROUP BY groups rows by one or more columns and lets you summarize each group with aggregate functions.
The Five Aggregate Functions
| Function | Purpose | Example |
|---|---|---|
COUNT(*) |
Count all rows (including NULLs) | COUNT(*) returns 5 if there are 5 rows |
COUNT(column) |
Count non-NULL values in a column | COUNT(RESULT_AMOUNT) counts result rows (skips NULLs) |
SUM(amount) |
Add up all values | SUM(RESULT_AMOUNT) = total earnings |
AVG(amount) |
Average of all values (NULLs excluded) | AVG(RESULT_AMOUNT) = average earning per transaction |
MIN(value)MAX(value) |
Minimum / Maximum value | MAX(RESULT_AMOUNT) = highest single earning |
GROUP BY Syntax and The GROUP BY Rule
Here's the critical rule: Every column in SELECT must either be in GROUP BY or wrapped in an aggregate function.
SELECT PARTICIPANT_ID, -- In GROUP BY ✓ SUM(RESULT_AMOUNT) AS TOTAL_EARNED -- Aggregate function ✓ FROM CSC_RESULTS WHERE PERIOD_ID = '2026-Q1' GROUP BY PARTICIPANT_ID ORDER BY TOTAL_EARNED DESC;
This query:
- Groups all results by PARTICIPANT_ID
- For each group, calculates SUM(RESULT_AMOUNT)
- Returns one row per participant with their total
If you tried to SELECT a column not in GROUP BY (like NAME), the database returns an error. It doesn't know which NAME to pick from potentially multiple results rows.
Real ICM Example: Total Earnings by Participant
Here's a production query that combines everything: participants, results, and aggregations:
SELECT p.PARTICIPANT_ID, p.NAME, p.REGION, COUNT(r.ID) AS TRANSACTION_COUNT, SUM(r.RESULT_AMOUNT) AS TOTAL_EARNED, AVG(r.RESULT_AMOUNT) AS AVG_PER_TRANSACTION, MAX(r.RESULT_AMOUNT) AS HIGHEST_EARNING, MIN(r.RESULT_AMOUNT) AS LOWEST_EARNING FROM CSC_PARTICIPANT p LEFT JOIN CSC_RESULTS r ON r.PARTICIPANT_ID = p.PARTICIPANT_ID WHERE p.STATUS = 'ACTIVE' AND r.PERIOD_ID = '2026-Q1' GROUP BY p.PARTICIPANT_ID, p.NAME, p.REGION ORDER BY TOTAL_EARNED DESC;
GROUP BY includes PARTICIPANT_ID, NAME, and REGION. All three must be in GROUP BY because they're in SELECT but not aggregated. The database groups by all three, ensuring one row per unique combination (though typically PARTICIPANT_ID is unique).
HAVING: Filtering After Aggregation
WHERE filters before aggregation. HAVING filters after. This matters:
-- Find active participants WHERE p.STATUS = 'ACTIVE'; -- Filters rows before grouping ✓ -- Find groups where total earned > $10,000 HAVING SUM(r.RESULT_AMOUNT) > 10000; -- Filters groups after aggregation ✓
The execution order is: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This is why HAVING can reference aggregate functions (they've already been calculated by then).
Real scenario: "Show me participants who earned more than $10,000 in Q1 2026, including their count of transactions and average earning per transaction."
SELECT p.NAME, COUNT(r.ID) AS TRANSACTIONS, SUM(r.RESULT_AMOUNT) AS TOTAL_EARNED, AVG(r.RESULT_AMOUNT) AS AVG_TRANSACTION FROM CSC_PARTICIPANT p LEFT JOIN CSC_RESULTS r ON r.PARTICIPANT_ID = p.PARTICIPANT_ID WHERE p.STATUS = 'ACTIVE' AND r.PERIOD_ID = '2026-Q1' GROUP BY p.PARTICIPANT_ID, p.NAME HAVING SUM(r.RESULT_AMOUNT) > 10000 ORDER BY TOTAL_EARNED DESC;
The WHERE clause filters to active participants in Q1 2026. GROUP BY groups by participant. HAVING then filters to only those groups with SUM > 10000. You can't use HAVING without GROUP BY (it has no meaning without groups).
NULL Behavior in Aggregates
NULLs are tricky in aggregates:
COUNT(*)counts all rows, including those with NULL valuesCOUNT(column)counts non-NULL values in that column. If column is NULL, the row is skippedSUM(amount)skips NULL rows. SUM of (100, NULL, 200) = 300, not NULLAVG(amount)skips NULL rows. AVG of (100, NULL, 200) = 150, not NULLMIN/MAX(amount)skip NULLs
-- Example data: (100, NULL, 200) COUNT(*) = 3 -- all rows, including NULL COUNT(amount) = 2 -- only non-NULL SUM(amount) = 300 -- 100 + 200, NULL skipped AVG(amount) = 150 -- 300 / 2, not 300 / 3
In ICM, this matters: if a participant has some results and some missing results (NULL), the aggregates correctly skip the missing data.
GROUP BY Multiple Columns: Earnings by Period and Region
You can group by more than one column:
SELECT r.PERIOD_ID, p.REGION, COUNT(r.ID) AS TRANSACTIONS, SUM(r.RESULT_AMOUNT) AS TOTAL_EARNED, AVG(r.RESULT_AMOUNT) AS AVG_TRANSACTION FROM CSC_PARTICIPANT p LEFT JOIN CSC_RESULTS r ON r.PARTICIPANT_ID = p.PARTICIPANT_ID WHERE p.STATUS = 'ACTIVE' GROUP BY r.PERIOD_ID, p.REGION ORDER BY r.PERIOD_ID ASC, TOTAL_EARNED DESC;
This groups by both PERIOD_ID and REGION, returning one row per unique combination. You get earnings totals broken down by period and region.
Avoiding GROUP BY Errors
Common mistakes:
- "Column X is not in GROUP BY": You selected X without aggregating it and it's not in GROUP BY. Either add it to GROUP BY or wrap it in an aggregate.
- "Aggregate function not allowed": You used COUNT or SUM in WHERE. Move it to HAVING.
- "Unexpected NULL values": Check if your JOIN produced NULLs. Use LEFT JOIN carefully when aggregating.
- "Wrong totals": Verify the WHERE clause filters correctly. If you filter after the JOIN, you might reduce the aggregate scope.
What Comes Next
GROUP BY and aggregates answer "what is the total?" In Lesson 05, we'll learn advanced queries that answer "rank participants by earnings" or "show running totals by period" without collapsing rows. That's where window functions and CTEs come in.