- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens
15 
Group By, Having, With, Union All
Now that we have some of the basics of SQL queries, we can move on to more advanced topics. Up to now, queries would return as many rows as we select thanks to the where ltering. This lter applies against a data set that is produced by the from clause and its joins in between relations.
The outer joins might produce more rows than you have in your reference data set, in particular, cross join is a Cartesian product.
In this section, we’ll have a look at aggregates. They work by computing a digest value for several input rows at a time. With aggregates, we can return a summary containing many fewer rows than passed the where lter.
Aggregates (aka Map/Reduce): Group By
The group by clause introduces aggregates in SQL, and allows implementing much the same thing as map/reduce in other systems: map your data into different groups, and in each group reduce the data set to a single value.
As a rst example we can count how many races have been run in each decade:
1select extract('year'
2 |
from |
3 |
date_trunc('decade', date)) |
4as decade,
5count(*)
Chapter 15 Group By, Having, With, Union All j 115
6from races
7 group by decade
8order by decade;
PostgreSQL o fers a rich set of date and times functions:
decade │ count
════════╪═══════
1950 |
│ |
84 |
1960 |
│ |
100 |
1970 |
│ |
144 |
1980 |
│ |
156 |
1990 |
│ |
162 |
2000 |
│ |
174 |
2010 |
│ |
156 |
(7 rows)
The di ference between each decade is easy to compute thanks to window function, seen later in this chapter. Let’s have a preview:
1 with races_per_decade
2as (
3select extract('year'
4 |
from |
5 |
date_trunc('decade', date)) |
6 |
as decade, |
7 |
count(*) as nbraces |
8from races
9 group by decade
10order by decade
11)
12select decade, nbraces,
13case
14when lag(nbraces, 1)
15 |
over(order by |
decade) is null |
16 |
then '' |
|
17 |
|
|
18 |
when nbraces - lag(nbraces, 1) |
|
19 |
over(order by decade) |
|
20 |
< 0 |
|
21 |
then format('-%3s', |
|
22 |
lag(nbraces, |
1) |
23 |
over(order by |
decade) |
24 |
- nbraces) |
|
25 |
|
|
26 |
else format('+%3s', |
|
27 |
nbraces |
|
28 |
- lag(nbraces, |
1) |
29 |
over(order by |
decade)) |
30 |
|
|
31 |
end as evolution |
|
Chapter 15 Group By, Having, With, Union All j 116
32 from races_per_decade;
We use a pretty complex CASE statement to elaborate on the exact output we want from the query. Other than that it’s using the lag() over(order by decade) expression that allows seeing the previous row, and moreover allows us to compute the di ference in between the current row and the previous one.
Here’s what we get from the previous query:
decade │ nbraces │ evolution
════════╪═════════╪═══════════
1950 |
│ |
84 |
│ |
1960 |
│ |
100 |
│ + 16 |
1970 |
│ |
144 |
│ + 44 |
1980 |
│ |
156 |
│ + 12 |
1990 |
│ |
162 |
│ + 6 |
2000 |
│ |
174 |
│ + 12 |
2010 |
│ |
156 |
│ - 18 |
(7 rows)
Now, we can also prepare the data set in a separate query that is run rst, called a common table expression and introduced by the with clause. We will expand on that idea in the upcoming pages.
PostgreSQL comes with the usual aggregates you would expect such as sum, count, and avg, and also with some more interesting ones such as bool_and. As its name suggests the bool_and aggregate starts true and remains true only if every row it sees evaluates to true.
With that aggregate, it’s then possible to search for all drivers who failed to nish any single race they participated in over their whole career:
1with counts as
2(
3select driverid, forename, surname,
4count(*) as races,
5 |
bool_and(position is null) as never_finished |
6from drivers
7join results using(driverid)
8 |
join races using(raceid) |
9group by driverid
10)
11select driverid, forename, surname, races
12from counts
13where never_finished
14order by races desc;
Well, it turns out that we have a great number of cases in which it happens. The
Chapter 15 Group By, Having, With, Union All j 117
previous query gives us 202 drivers who never |
nished a single race they took part |
in, 117 of them had only participated in a single race that said. |
|
Not picking on anyone in particular, we can |
nd out if some seasons were less |
lucky than others on that basis and search for drivers who didn’t nish a single race they participated into, per season:
1with counts as
2(
3select date_trunc('year', date) as year,
4count(*) filter(where position is null) as outs,
5 |
bool_and(position is null) as never_finished |
6from drivers
7join results using(driverid)
8join races using(raceid)
9group by date_trunc('year', date), driverid
10)
11select extract(year from year) as season,
12sum(outs) as "#times any driver didn't finish a race"
13from counts
14where never_finished
15group by season
16order by sum(outs) desc
17limit 5;
In this query, you can see the aggregate filter(where …) syntax that allows us to update our computation only for those rows that pass the lter. Here we choose to count all race results where the position is null, which means the driver didn’t make it to the nish line for some reason…
season │ #times any driver didn't finish a race
════════╪════════════════════════════════════════
1989 |
│ |
139 |
1953 |
│ |
51 |
1955 |
│ |
48 |
1990 |
│ |
48 |
1956 |
│ |
46 |
(5 rows)
It turns out that overall, 1989 was a pretty bad season.
Aggregates Without a Group By
It is possible to compute aggregates over a data set without using the group by clause in SQL. What it then means is that we are operating over a single group
