
- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens

46
Denormalizing Tags with intarray
Handling user-de ned tags can be challenging in SQL when it comes to allowing advanced user queries. To illustrate the point here, we’re going to index and search for Last.fm tracks that are tagged as blu and rhythm and blu .
Using teh Last.fm dataset from the Million Song Dataset project provides a data set that we can reuse that is full of tracks and their user tags.
Advanced Tag Indexing
PostgreSQL comes with plenty of interesting datatypes, and one of them is known as the arrays type. PostgreSQL also provides a very rich set of extensions, some of them found under the contrib package; one of them is intarray. Let me quote the most interesting part of the documentation for that extension:
The @@ and ~~ operators test whether an array satis es a query, which is expressed as a value of a specialized data type query_int. A query consists of integer values that are checked against the elements of the array, possibly combined using the operators & (AND), | (OR), and ! (NOT). Parentheses can be used as needed. For example, the query 1&(2|3) matches arrays that contain 1 and also contain either 2 or 3.
1 create extension intarray;

Chapter 46 Denormalizing Tags with intarray j 387
The way the intarray extension works, we need to build a new table that contains the list of tags it’s been associated with for each track as an array of integers. We’re going to use our rowid identi er for that purpose, as in the following query:
1select tt.tid, array_agg(tags.rowid) as tags
2 |
from |
tags |
3join tid_tag tt
4 |
on tags.rowid = tt.tag |
5 |
group by tt.tid |
6limit 3;
And here are our rst three songs with tags as numbers rather than strings:
tid │ tags
═════╪═══════════
1 │ {1,2}
2 │ {3,4}
3 │ {5,6,7,8} (3 rows)
We might not want to do this computation of tags text to an array of numbers for every title we have, so we can cache the result in a materialized view instead:
1begin;
2
3create view lastfm.v_track_tags as
4select tt.tid, array_agg(tags.rowid) as tags
5 from tags join tid_tag tt on tags.rowid = tt.tag
6group by tt.tid;
7
8 create materialized view lastfm.track_tags as
9select tid, tags
10 from v_track_tags;
11
12 create index on track_tags using gin(tags gin__int_ops);
13
14 commit;
Given this materialized view, we are going to be able to do advanced indexing and searching of the user provided tags. As you can see in the previous SQL script, we have been indexing our materialized view with a special index operator, allowing us to bene t from the intarray advanced querying.

Chapter 46 Denormalizing Tags with intarray j 388
Searches
Now we are ready for the real magic. Let’s nd all the tracks we have that have been tagged as both blu and rhythm and blu :
1 select array_agg(rowid)
2from tags
3where tag = 'blues' or tag = 'rhythm and blues';
That query gives the following result, which might not seem very interesting at rst:
array_agg
═══════════
{3,739} (1 row)
The intarray PostgreSQL extension implements a special kind of query string, named query_int. It looks like '(1880&179879)' and it supports the three logic operators not, and, and or, that you can combine in your queries.
As we want our tag search queries to be dynamically provided by our users, we are going to build the query_int string from the tags table itself:
1select format('(%s)',
2 |
string_agg(rowid::text, '&') |
3 |
)::query_int as query |
4from tags
5where tag = 'blues' or tag = 'rhythm and blues';
This query uses the format PostgreSQL function to build a string for us, here puting our intermediate result inside parentheses. The intermediate result is obtained with string_agg which aggregates text values together, using a separator in between them. Usually the separator would be a comma or a semicolon. Here we are preparing a query_int string, and we’re going to search for all the tracks that have been tagged both blu and rhythm and blu , so we’re using the and operator, written &:
query
═════════
3 & 739 (1 row)
That query here allows us to easily inject as many tags as we want to, so that it’s easy to use it as a template from within an application where the user is going to provide the tags list. The intarray extension’s query format also accepts other

Chapter 46 Denormalizing Tags with intarray j 389
operators (or and not) as we saw before, so if you want to o fer those to your users you would need to tweak the query_int building part of the SQL.
Now, how many tracks have been tagged with both the blu and the rhythm and blu tags, you might be asking:
1with t(query) as (
2select format('(%s)',
3 |
array_to_string(array_agg(rowid), '&') |
4 |
)::query_int as query |
5from tags
6where tag = 'blues' or tag = 'rhythm and blues'
7)
8select count(*)
9from track_tags join t on tags @@ query;
As you can see we use the query template from above in a common table expression and then inject it in the nal SQL query as join restriction over the track_tags table.
count
═══════
2278 (1 row)
We have 2278 tracks tagged with both the blu and rhythm and blu tags.
Now of course you might want to fetch some track meta-data, but here the only one we have is the track hash id:
1with t(query) as (
2select format('(%s)',
3 |
array_to_string(array_agg(rowid), '&') |
4 |
)::query_int as query |
5from tags
6where tag = 'blues' or tag = 'rhythm and blues'
7)
8select track.tid,
9 |
left(track.artist, 26) |
10|| case when length(track.artist) > 26 then '…' else '' end
11as artist,
12left(track.title, 26)
13|| case when length(track.title) > 26 then '…' else '' end
14as title
15from track_tags tt
16join tids on tt.tid = tids.rowid
17join t on tt.tags @@ t.query
18join lastfm.track on tids.tid = track.tid
19order by artist;
That gives us the following result: