
- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens

Chapter 50 Counting Distinct Users with HyperLogLog j 418
3 |
│ |
2018-08-15 |
│ |
3308 |
│ |
3227 |
│ |
81 |
3 |
│ |
2018-08-16 |
│ |
3261 |
│ |
3184 |
│ |
77 |
(10 rows)
Lossy Unique Count with HLL
We can rewrite the previous query using our HLL data type now, even though at this stage it’s not going to be very useful, because we still have the full logs of every visit and we can a ford to compute precise counts.
Nonetheless, our goal is to dispose of the daily entries, that we anticipate will be just too large a data set. So, the hll-based query looks like this:
1select messageid,
2datetime::date as date,
3 |
# hll_add_agg(hll_hash_text(ipaddr::text)) as hll |
4 |
from tweet.visitor |
5where messageid = 3
6group by grouping sets((messageid),
7 |
(messageid, date)) |
8 |
order by messageid, date nulls first |
9limit 10;
In this query we use several new functions and operators related to the hll data type:
•The # operator takes a single argument: it’s a unary operator, like factorial (written !) for example. This unary operator when applied to a value of type hll computes the estimated number of distinct entries stored in the hyperloglog set.
•The hll_add_agg() aggregate function accumulates new hashes into a given hyperloglog set.
•The hll_hash_text function computes the hyperloglog hash of a text
value, here used with the IP address as a text form. We could also use the IP address as a 32-bit integer with the hll_hash_integer function instead, but then this wouldn’t support IPv6 addresses, which only t in a 128-bit number.
The notation # hll shows the level of exibility that PostgreSQL brings to the table with its extensibility support. Not only can you de ne new operators at runtime from an extension, but those operators can also be unary or binary.

Chapter 50 Counting Distinct Users with HyperLogLog j 419
The lossy distinct count result looks like this:
messageid │ date │ hll
═══════════╪════════════╪══════════════════
3│ 2018-08-07 │ 739.920627061887
3│ 2018-08-08 │ 3284.16386418662
3│ 2018-08-09 │ 3196.58757626223
3│ 2018-08-10 │ 3036.32707701154
3│ 2018-08-11 │ 3140.21704515932
3│ 2018-08-12 │ 3191.83031512197
3│ 2018-08-13 │ 3045.15467688584
3│ 2018-08-14 │ 3031.92750496513
3│ 2018-08-15 │ 3135.58879460201
3│ 2018-08-16 │ 3230.20146096767
(10 rows)
When used that way, the hll feature set doesn’t make much sense. We still have to process as many rows as before, but we lose some precision in the result. The reason why we’ve done that query here is to show the following:
1.Demonstrate how to use the hll operators and functions in a query
2.Show that the estimates from the hll data structures are pretty good, even at this low cardinality
Getting the Visits into Unique Counts
In a production setup we would have the following context and constraints:
•Tweets are published and users from the Internet are visiting our tweets.
•Our application inserts a new row in tweet.visitor with the visitor’s IP address each time there is a new visit to one of our tweet. It also registers the precise timestamp of the visit.
•As we anticipate quite some success on our little application idea, we also
anticipate not being able to keep all the visitor logs, and not being able to respect our quality of service terms when computing the unique visitors on the y each time someone needs them.
•Finally, as the numbers being used in a marketing context rather than in an invoicing context, we are in a position to lose some precision over the number, and we would actually like to implement a system that is lossy if it allows us to relax our storage and processing requirements.

Chapter 50 Counting Distinct Users with HyperLogLog j 420
The previous sections present a great tool for achieving the last point above, and now is the time to put hll to good use. From the tweet.visitor table we are now going to compute a single hyperloglog value per message and per day:
1begin;
2
3with new_visitors as
4(
5delete from tweet.visitor
6where id = any (
7 |
select |
id |
8 |
from |
tweet.visitor |
9 |
order by |
datetime, messageid |
10 |
for update |
|
11 |
skip locked |
|
12 |
limit |
1000 |
13 |
) |
|
14 |
returning messageid, |
|
15 |
cast(datetime as date) as date, |
|
16 |
hll_hash_text(ipaddr::text) as visitors |
17),
18new_visitor_groups as
19(
20select messageid, date, hll_add_agg(visitors) as visitors
21from new_visitors
22group by messageid, date
23)
24insert into tweet.uniques
25select messageid, date, visitors
26from new_visitor_groups
27on conflict (messageid, date)
28do update set visitors = hll_union(uniques.visitors, excluded.visitors)
29 |
where |
uniques.messageid = excluded.messageid |
||
30 |
and |
uniques.date |
= |
excluded.date |
31 |
returning messageid, date, |
# |
visitors as uniques; |
|
32 |
|
|
|
|
33 |
rollback; |
|
|
|
This query is implemented in several stages thanks to the PostgreSQL support for writable common table expressions:
1.Compute new_visitors by deleting from the buffer table tweet.visitor a thousand rows at a time, and using the skip locked facility that is new in PostgreSQL 9.5.
By default, when attempting to delete a row that is already in use by another transaction doing either an update or a delete, PostgreSQL would have to block until the other transaction released its lock. With the skip locked clause, PostgreSQL can omit the row from the current transaction
Chapter 50 Counting Distinct Users with HyperLogLog j 421
without incurring any locking or waiting.
Rows skipped that way may appear in the next batch, or they may already be concurrently processed in another batch.
This construct allows the query to be run in more than one transaction at the same time, which might in turn be useful if we ever have some serious lag in our processing.
2. This rst CTE of our query then also computes the date from the timestamp with a CAST expression, and the hll hash from the IP address, preparing for the next stage of processing.
3.Compute the new_visitor_groups by aggregating the just computed hll individual hashes into a single hll set per messageid and per date.
4.Finally, insert those messages daily unique visitors hll sets into our summary table tweet.uniques. Of course, if we did compute a set for the same message and the same day before, we then update and hll_union the existing and the new set together.
5.Because PostgreSQL is such a powerful system, of course we return the result of processing the given batch at the end of the query, using the returning clause of the insert command.
The do update set clause requires that any single row in the target table be updated only once per command, in order to ensure that the con ict handling mechanism is deterministic. That’s the reason why we prepare the hll sets in the
new_visitor_groups CTE part of the query.
When running this query, we obtain the following result:
BEGIN
messageid │ date │ uniques
═══════════╪════════════╪══════════════════
3│ 2018-08-07 │ 739.920627061887
3│ 2018-08-08 │ 257.534468469694
(2 rows)
INSERT 0 2
ROLLBACK
Notice that we nish our script with a rollback command. That allows us to debug and re ne the query until we’re happy. This 5-stage, 29-line SQL query isn’t going to be too complex to maintain thanks to its actions being well separated using CTE, it still doesn’t get written in a single session in a text le. It gets