- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens
50 
Counting Distinct Users with HyperLogLog
If you’ve been following along at home and keeping up with the newer statistics developments, you might have heard about this new state of the art cardinality estimation algorithm called HyperLogLog.
This technique is now available for PostgreSQL in the extension postgresql-hll available at https://github.com/citusdata/postgresql-hll and is packaged for multiple operating systems such as Debian and RHEL, through the PostgreSQL community packaging e forts and resources.
HyperLogLog
HyperLogLog is a very special hash value. It aggregates enough information into a single scalar value to compute a distinct value with some precision loss.
Say we are counting unique visitors. With HyperLogLog we can maintain a single value per day, and then union those values together to obtain unique weekly or monthly visitor counts!
Here’s an example in SQL of the magic provided by the hll extension:
1select to_char(date, 'YYYY/MM') as month,
2 |
round(#hll_union_agg(users)) as monthly |
Chapter 50 Counting Distinct Users with HyperLogLog j 414
3 from daily_uniques
4group by month;
While we are keeping daily aggregates on disk, we can use the HyperLogLog maths to union them together and compute an approximation of the monthly unique count from the same dataset!
month | monthly |
|
---------+ |
--------- |
2013/02 | |
1960380 |
(1 row)
So by keeping only a small amount of data per day, typically 1280 bytes, it is then possible to compute monthly unique counts from that, without having to scan a whole month of records again.
Installing postgresql-hll
It’s as simple as create extension hll;, once the OS package is installed on your system. The extension provides a new datatype named hll and we can use \dx+ hll to discover what kind of support comes with it. Here’s an edited version of the output of the \dx+ hll command, where some lines have been l- tered out of the 71 SQL objects:
Objects in extension "hll" Object description
══════════════════════════════════════════════════════════════════════════════
cast from bigint to hll_hashval cast from bytea to hll
cast from hll to hll
cast from integer to hll_hashval function hll(hll,integer,boolean) function hll_add(hll,hll_hashval) function hll_add_agg(hll_hashval)
function hll_add_agg(hll_hashval,integer) function hll_add_agg(hll_hashval,integer,integer)
function hll_add_agg(hll_hashval,integer,integer,bigint) function hll_add_agg(hll_hashval,integer,integer,bigint,integer) function hll_add_rev(hll_hashval,hll)
...
function hll_cardinality(hll) function hll_empty()
function hll_eq(hll,hll)
...
function hll_hash_any(anyelement,integer) function hll_hash_bigint(bigint,integer) function hll_hash_boolean(boolean,integer) function hll_hash_bytea(bytea,integer)
Chapter 50 Counting Distinct Users with HyperLogLog j 415
function hll_hash_integer(integer,integer) function hll_hash_smallint(smallint,integer) function hll_hash_text(text,integer)
...
operator #(NONE,hll) operator <>(hll,hll)
operator <>(hll_hashval,hll_hashval) operator =(hll,hll)
operator =(hll_hashval,hll_hashval) operator ||(hll,hll)
operator ||(hll,hll_hashval) operator ||(hll_hashval,hll) type hll
type hll_hashval
From that output we learn the list of hll operators, such as the interesting # operator, a unary operator that works on an hll value. More about this one later…
Counting Unique Tweet Visitors
As an example use case for the HyperLogLog data type, we are going to count unique visitors to our tweets, using the application we introduced in Data Manipulation and Concurrency Control.
The two main operations around an hll data type consists of the following:
• Build a hash from an input value, such as an IP address.
• Update the already known hll value with the hash.
The main idea behind hll is to keep a single hll value per granularity, here per tweet message and per day. This means that each time we have a new visit on a tweet, we want to UPDATE our hll set to count that visitor.
As we have seen in the previous chapter, concurrency is a deal breaker for UPDATE heavy scenarios where the same row is solicited over and over again. So we are going to work in two steps again here, rst doing an INSERT per visit and then arranging a background process to transform those visits into an UPDATE to the single hll aggregate per tweet and date.
Here’s the visitor table where we can insert every single visit:
1create table tweet.visitor
2(
3 |
id |
bigserial primary key, |
|
|
Chapter 50 Counting Distinct Users with HyperLogLog j 416 |
4 |
messageid |
bigint not null references tweet.message(messageid), |
5 |
datetime |
timestamptz not null default now(), |
6 |
ipaddr |
ipaddress, |
7 |
|
|
8unique(messageid, datetime, ipaddr)
9);
It’s a pretty simple structure, and is meant to register our online activity.
We can generate some tweet visits easily with a program such as the following. Again, I’m using Common Lisp to implement a very simple COPY-based loading program.
1(defparameter *connspec* '("appdev" "dim" nil "localhost"))
2 (defparameter *visitor-table* "tweet.visitor")
3(defparameter *visitor-columns* '("messageid" "ipaddr" "datetime"))
4 |
|
|
5 |
(defun insert-visistors |
(messageid n &optional (connspec *connspec*)) |
6 |
(pomo:with-connection |
connspec |
7(let ((count 0)
8(copier (open-db-writer connspec *visitor-table* *visitor-columns*)))
9(unwind-protect
10 |
(loop :for i :below n |
|
11 |
:do (let ((ipaddr |
(generate-ipaddress)) |
12 |
(datetime |
(format nil "~a" (generate-timestamp)))) |
13 |
(db-write-row |
copier (list messageid ipaddr datetime)) |
14 |
(incf count))) |
|
15 |
(close-db-writer copier)) |
|
16 |
|
|
17;; and return the number of rows copied
18count)))
The script is written so as to target a smallish range of IP addresses and range of dates in order to generate collisions and see our unique visitors count as being more than one.
1;;;
2;;; select '192.168.0.0'::ip4::bigint; == 3232235520
3;;;
4 (defparameter *ip-range-start* 3232235520)
5(defparameter *ip-range-size* (expt 2 16))
6
7(defun generate-ipaddress (&optional
8 |
(range-size *ip-range-size*) |
9 |
(range-start *ip-range-start*)) |
10"Generate N random IP addresses, as strings."
11(int-to-ip (+ range-start (random range-size))))
12
13(defun generate-timestamp ()
14"Generate a random timestamp between now and a month ago."
15(local-time:timestamp- (local-time:now) (random #. (* 24 60 31)) :minute))
Chapter 50 Counting Distinct Users with HyperLogLog j 417
When generating data with those function, we pick the subnet in 192.168.0.0/16 and a span of a month of data. Here’s how to interactively generate 100 000 visits from the Common Lisp REPL, measuring the time that takes:
1 CL-USER> (time (shakes::insert-visistors 3 100000))
2(SHAKES::INSERT-VISISTORS 3 100000)
3took 7,513,209 microseconds (7.513209 seconds) to run.
4 244,590 microseconds (0.244590 seconds, 3.26%) of which was spent in GC.
5During that period, and with 4 available CPU cores,
65,242,334 microseconds (5.242334 seconds) were spent in user mode
7 314,728 microseconds (0.314728 seconds) were spent in system mode
8691,153,296 bytes of memory allocated.
9 770 minor page faults, 0 major page faults, 0 swaps. 10 100000
Thanks to using the COPY streaming protocol, we can mix generating the numbers and communicating with the PostgreSQL server, and have our hundred thousand visits be generated in the database in less than 8s on my laptop. That’s certainly fast enough for interactive discovery of a data model. It’s quite easy with PostgreSQL to just try it and see.
We can check the result of inserting 100000 visits to the messageid 3 with the following query:
1select messageid,
2datetime::date as date,
3count(*) as count,
4count(distinct ipaddr) as uniques,
5 |
count(*) - count(distinct ipaddr) as duplicates |
6 |
from tweet.visitor |
7where messageid = 3
8 group by messageid, date
9 order by messageid, date
10 limit 10;
We have a precise count of all the visitors to the message, and we can see that even with a 16-bits range of IP addresses we already have several visits from the same IP addresses.
messageid │ date │ count │ uniques │ duplicates
═══════════╪════════════╪═══════╪═════════╪════════════
3 |
│ 2018-08-07 │ |
746 |
│ |
742 |
│ |
4 |
|
3 |
│ 2018-08-08 │ |
3298 |
│ |
3211 |
│ |
87 |
|
3 |
│ 2018-08-09 │ |
3260 |
│ |
3191 |
│ |
69 |
|
3 |
│ 2018-08-10 │ |
3156 |
│ |
3077 |
│ |
79 |
|
3 |
│ 2018-08-11 |
│ |
3241 |
│ |
3161 |
│ |
80 |
3 |
│ 2018-08-12 |
│ |
3270 |
│ |
3197 |
│ |
73 |
3 |
│ 2018-08-13 |
│ |
3182 |
│ |
3106 |
│ |
76 |
3 |
│ 2018-08-14 |
│ |
3199 |
│ |
3124 |
│ |
75 |
