
- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens

Chapter 22 PostgreSQL Data Types j 187
set, having this processing feature set right in PostgreSQL makes sense in several use cases:
• It makes sense when the SQL logic or ltering you want to implement depends on the result of the processing (e.g. grouping by week).
•When you have several applications using the same logic, it’s of en easier to share a SQL query than to set up a distributed service API o fering the same result in XML or JSON (a data format you then have to parse).
•When you want to reduce your run-time dependencies, it’s a good idea to understand how much each architecture layer is able to support in your implementation.
Network Address Types
PostgreSQL includes support for both cidr, inet, and macaddr data types. Again, those types are bundled with indexing support and advanced functions and operator support.
The PostgreSQL documentation chapters entitled Network Address Types and Network Address Functions and Operators cover network address types.
Web servers logs are a classic source of data to process where we nd network address types and The Honeynet Project has some free samples for us to play with. This time we’re using the Scan 34 entry. Here’s how to load the sample data set, once cleaned into a proper CSV le:
1begin;
2
3drop table if exists access_log;
4
5create table access_log
6(
7 |
ip |
inet, |
8 |
ts |
timestamptz, |
9request text,
10 |
status integer |
11 |
); |
12 |
|
13 |
\copy access_log from 'access.csv' with csv delimiter ';' |
14 |
|
15 |
commit; |

Chapter 22 PostgreSQL Data Types j 188
The script used to cleanse the original data into a CSV that PostgreSQL is happy about implements a pretty simple transformation from
211.141.115.145 - - [13/Mar/2005:04:10:18 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozill
into
"211.141.115.145";"2005-05-13 04:10:18 -0500";"GET / HTTP/1.1";"403"
Being mostly interested into network address types, the transformation from the Apache access log format to CSV is lossy here, we keep only some of the elds we might be interested into.
One of the things that’s possible to implement thanks to the PostgreSQL inet data type is an analysis of /24 networks that are to be found in the logs.
To enable that analysis, we can use the set_masklen() function which allows us to transforms an IP address into an arbitrary CIDR network address:
1select distinct on (ip)
2ip,
3set_masklen(ip, 24) as inet_24,
4 set_masklen(ip::cidr, 24) as cidr_24 5 from access_log
6limit 10;
And we can see that if we keep the data type as inet, we still get the full IP address with the /24 network notation added. To have the .0/24 notation we need to be using cidr:
ip │ inet_24 │ cidr_24
═══════════════╪══════════════════╪═════════════════
4.35.221.243 │ 4.35.221.243/24 │ 4.35.221.0/24 4.152.207.126 │ 4.152.207.126/24 │ 4.152.207.0/24 4.152.207.238 │ 4.152.207.238/24 │ 4.152.207.0/24
4.249.111.162 |
│ 4.249.111.162/24 |
│ 4.249.111.0/24 |
12.1.223.132 |
│ 12.1.223.132/24 |
│ 12.1.223.0/24 |
12.8.192.60 |
│ 12.8.192.60/24 |
│ 12.8.192.0/24 |
12.33.114.7 |
│ 12.33.114.7/24 |
│ 12.33.114.0/24 |
12.47.120.130 |
│ 12.47.120.130/24 |
│ 12.47.120.0/24 |
12.172.137.4 |
│ 12.172.137.4/24 |
│ 12.172.137.0/24 |
18.194.1.122 |
│ 18.194.1.122/24 |
│ 18.194.1.0/24 |
(10 rows)
Of course, note that you could be analyzing other networks than /24:
1select distinct on (ip)
2ip,
3 |
set_masklen(ip::cidr, |
27) |
as |
cidr_27, |
4 |
set_masklen(ip::cidr, |
28) |
as |
cidr_28 |

Chapter 22 PostgreSQL Data Types j 189
5 from access_log
6limit 10;
This computes for us the proper starting ip addresses for our CIDR notation for us, of course. Af er all, what’s the point of using proper data types if not for advanced processing?
ip │ cidr_27 │ cidr_28
═══════════════╪══════════════════╪══════════════════
4.35.221.243 │ 4.35.221.224/27 │ 4.35.221.240/28 4.152.207.126 │ 4.152.207.96/27 │ 4.152.207.112/28 4.152.207.238 │ 4.152.207.224/27 │ 4.152.207.224/28
4.249.111.162 |
│ 4.249.111.160/27 |
│ 4.249.111.160/28 |
12.1.223.132 |
│ 12.1.223.128/27 |
│ 12.1.223.128/28 |
12.8.192.60 |
│ 12.8.192.32/27 |
│ 12.8.192.48/28 |
12.33.114.7 |
│ 12.33.114.0/27 |
│ 12.33.114.0/28 |
12.47.120.130 |
│ 12.47.120.128/27 |
│ 12.47.120.128/28 |
12.172.137.4 |
│ 12.172.137.0/27 |
│ 12.172.137.0/28 |
18.194.1.122 |
│ 18.194.1.96/27 |
│ 18.194.1.112/28 |
(10 rows)
Equipped with this set_masklen() function, it’s now easy to analyze our access logs using arbitrary CIDR network de nitions.
1select set_masklen(ip::cidr, 24) as network,
2count(*) as requests,
3 |
array_length(array_agg(distinct ip), 1) as ipcount |
4 |
from access_log |
5group by network
6 having array_length(array_agg(distinct ip), 1) > 1
7order by requests desc, ipcount desc;
In our case, we get the following result:
network │ requests │ ipcount
══════════════════╪══════════╪═════════
4.152.207.0/24 |
│ |
140 |
│ |
2 |
222.95.35.0/24 |
│ |
59 |
│ |
2 |
211.59.0.0/24 |
│ |
32 |
│ |
2 |
61.10.7.0/24 |
│ |
25 |
│ |
25 |
222.166.160.0/24 |
│ |
25 |
│ |
24 |
219.153.10.0/24 |
│ |
7 |
│ |
3 |
218.78.209.0/24 |
│ |
6 |
│ |
4 |
193.109.122.0/24 │ |
5 |
│ |
5 |
|
204.102.106.0/24 |
│ |
3 |
│ |
3 |
66.134.74.0/24 |
│ |
2 |
│ |
2 |
219.133.137.0/24 |
│ |
2 |
│ |
2 |
61.180.25.0/24 |
│ |
2 |
│ |
2 |
(12 rows)

Chapter 22 PostgreSQL Data Types j 190
Ranges
Range types are a unique feature of PostgreSQL, managing two dimensions of data in a single column, and allowing advanced processing. The main example is the daterange data type, which stores as a single value a lower and an upper bound of the range as a single value. This allows PostgreSQL to implement a concurrent safe check against overlapping ranges, as we’re going to see in the next example.
As usual, read the PostgreSQL documentation chapters with the titles Range Types and Range Functions and Operators for complete information.
The International Monetary Fund publishes exchange rate archives by month for lots of currencies. An exchange rate is relevant from its publication until the next rate is published, which makes a very good use case for our PostgreSQL range types.
The following SQL script is the main part of the ELT script that has been used for this book. Only missing from this book’s pages is the transformation script that pivots the available tsv le into the more interesting format we use here:
1begin;
2
3create schema if not exists raw;
4
5 -- Must be run as a Super User in your database instance
6-- create extension if not exists btree_gist;
7
8drop table if exists raw.rates, rates;
9
10create table raw.rates
11(
12currency text,
13 |
date |
date, |
14 |
rate |
numeric |
15 |
); |
|
16 |
|
|
17 |
\copy raw.rates from 'rates.csv' with csv delimiter ';' |
|
18 |
|
|
19create table rates
20(
21currency text,
22validity daterange,
23 |
rate |
numeric, |
24 |
|
|
25 |
exclude using gist (currency with =, |

Chapter 22 PostgreSQL Data Types j 191
26 |
validity with &&) |
27 |
); |
28 |
|
29insert into rates(currency, validity, rate)
30select currency,
31 |
daterange(date, |
32 |
lead(date) over(partition by currency |
33 |
order by date), |
34 |
'[)' |
35 |
) |
36 |
as validity, |
37rate
38from raw.rates
39order by date;
40 |
|
41 |
commit; |
In this SQL script, we rst create a target table for loading the CSV le. The le contains lines with a currency name, a date of publication, and a rate as a numeric value. Once the data is loaded into this table, we can transform it into something more interesting to work with from an application, the rat table.
The rat table registers the rate value for a currency and a validity period, and uses an exclusion constraint that guarantees non-overlapping validity periods for any given currency:
1exclude using gist (currency with =, validity with &&)
This expression reads: exclude any tuple where the currency is = to an existing currency in our table AND where the validity is overlapping with (&&) any existing validity in our table. This exclusion constraint is implemented in PostgreSQL using a GiST index.
By default, GiST in PostgreSQL doesn’t support one-dimensional data types that are meant to be covered by B-tree indexes. With exclusion constraints though, it’s very interesting to extend GiST support for one-dimensional data types, and so we install the btree_gist extension, provided in PostgreSQL contrib package.
The script then lls in the rat table from the raw.rat we’d been importing in the previous step. The query uses the lead() window function to implement the speci cation spelled out in English earlier: an exchange rate relevant from its publication until the next rate published.
Here’s how the data looks, with the following query targeting Euro rates:
1 select currency, validity, rate

Chapter 22 PostgreSQL Data Types j 192
2from rates
3 |
where |
currency |
= 'Euro' |
4 |
order by |
validity |
|
5limit 10;
We can see that the validity is a range of dates, and the standard output for this
type is a closed range which includes the |
rst entry and excludes the second one: |
|||
currency |
│ |
validity |
│ |
rate |
══════════╪═════════════════════════╪══════════ |
||||
Euro |
│ [2017-05-02,2017-05-03) │ 1.254600 |
|||
Euro |
│ [2017-05-03,2017-05-04) │ 1.254030 |
|||
Euro |
│ [2017-05-04,2017-05-05) │ 1.252780 |
|||
Euro |
│ [2017-05-05,2017-05-08) │ 1.250510 |
|||
Euro |
│ [2017-05-08,2017-05-09) │ 1.252880 |
|||
Euro |
│ [2017-05-09,2017-05-10) │ 1.255280 |
|||
Euro |
│ [2017-05-10,2017-05-11) │ 1.255300 |
|||
Euro |
│ [2017-05-11,2017-05-12) |
│ 1.257320 |
||
Euro |
│ [2017-05-12,2017-05-15) |
│ 1.255530 |
||
Euro |
│ [2017-05-15,2017-05-16) |
│ 1.248960 |
||
(10 rows) |
|
|
|
|
Having this data set with the exclusion constraint means that we know we have at most a single rate available at any point in time, which allows an application needing the rate for a speci c time to write the following query:
\index{Operators!@}
1 select rate
2from rates
3where currency = 'Euro'
4and validity @> date '2017-05-18';
The operator @> reads contains, and PostgreSQL uses the exclusion constraint’s index to solve that query e ciently:
rate
══════════
1.240740 (1 row)