- •About…
- •About the Book
- •About the Author
- •Acknowledgements
- •About the organisation of the books
- •Structured Query Language
- •A First Use Case
- •Loading the Data Set
- •Application Code and SQL
- •Back to Discovering SQL
- •Computing Weekly Changes
- •Software Architecture
- •Why PostgreSQL?
- •The PostgreSQL Documentation
- •Getting Ready to read this Book
- •Business Logic
- •Every SQL query embeds some business logic
- •Business Logic Applies to Use Cases
- •Correctness
- •Efficiency
- •Stored Procedures — a Data Access API
- •Procedural Code and Stored Procedures
- •Where to Implement Business Logic?
- •A Small Application
- •Readme First Driven Development
- •Chinook Database
- •Top-N Artists by Genre
- •Intro to psql
- •The psqlrc Setup
- •Transactions and psql Behavior
- •Discovering a Schema
- •Interactive Query Editor
- •SQL is Code
- •SQL style guidelines
- •Comments
- •Unit Tests
- •Regression Tests
- •A Closer Look
- •Indexing Strategy
- •Indexing for Queries
- •Choosing Queries to Optimize
- •PostgreSQL Index Access Methods
- •Advanced Indexing
- •Adding Indexes
- •An Interview with Yohann Gabory
- •Get Some Data
- •Structured Query Language
- •Queries, DML, DDL, TCL, DCL
- •Select, From, Where
- •Anatomy of a Select Statement
- •Projection (output): Select
- •Restrictions: Where
- •Order By, Limit, No Offset
- •Ordering with Order By
- •kNN Ordering and GiST indexes
- •Top-N sorts: Limit
- •No Offset, and how to implement pagination
- •Group By, Having, With, Union All
- •Aggregates (aka Map/Reduce): Group By
- •Aggregates Without a Group By
- •Restrict Selected Groups: Having
- •Grouping Sets
- •Common Table Expressions: With
- •Distinct On
- •Result Sets Operations
- •Understanding Nulls
- •Three-Valued Logic
- •Not Null Constraints
- •Outer Joins Introducing Nulls
- •Using Null in Applications
- •Understanding Window Functions
- •Windows and Frames
- •Partitioning into Different Frames
- •Available Window Functions
- •When to Use Window Functions
- •Relations
- •SQL Join Types
- •An Interview with Markus Winand
- •Serialization and Deserialization
- •Some Relational Theory
- •Attribute Values, Data Domains and Data Types
- •Consistency and Data Type Behavior
- •PostgreSQL Data Types
- •Boolean
- •Character and Text
- •Server Encoding and Client Encoding
- •Numbers
- •Floating Point Numbers
- •Sequences and the Serial Pseudo Data Type
- •Universally Unique Identifier: UUID
- •Date/Time and Time Zones
- •Time Intervals
- •Date/Time Processing and Querying
- •Network Address Types
- •Denormalized Data Types
- •Arrays
- •Composite Types
- •Enum
- •PostgreSQL Extensions
- •An interview with Grégoire Hubert
- •Object Relational Mapping
- •Tooling for Database Modeling
- •How to Write a Database Model
- •Generating Random Data
- •Modeling Example
- •Normalization
- •Data Structures and Algorithms
- •Normal Forms
- •Database Anomalies
- •Modeling an Address Field
- •Primary Keys
- •Foreign Keys Constraints
- •Not Null Constraints
- •Check Constraints and Domains
- •Exclusion Constraints
- •Practical Use Case: Geonames
- •Features
- •Countries
- •Modelization Anti-Patterns
- •Entity Attribute Values
- •Multiple Values per Column
- •UUIDs
- •Denormalization
- •Premature Optimization
- •Functional Dependency Trade-Offs
- •Denormalization with PostgreSQL
- •Materialized Views
- •History Tables and Audit Trails
- •Validity Period as a Range
- •Pre-Computed Values
- •Enumerated Types
- •Multiple Values per Attribute
- •The Spare Matrix Model
- •Denormalize wih Care
- •Not Only SQL
- •Schemaless Design in PostgreSQL
- •Durability Trade-Offs
- •Another Small Application
- •Insert, Update, Delete
- •Insert Into
- •Insert Into … Select
- •Update
- •Inserting Some Tweets
- •Delete
- •Tuples and Rows
- •Deleting All the Rows: Truncate
- •Isolation and Locking
- •About SSI
- •Putting Concurrency to the Test
- •Computing and Caching in SQL
- •Views
- •Materialized Views
- •Triggers
- •Transactional Event Driven Processing
- •Trigger and Counters Anti-Pattern
- •Fixing the Behavior
- •Event Triggers
- •Listen and Notify
- •PostgreSQL Notifications
- •Notifications and Cache Maintenance
- •Listen and Notify Support in Drivers
- •Batch Update, MoMA Collection
- •Updating the Data
- •Concurrency Patterns
- •On Conflict Do Nothing
- •An Interview with Kris Jenkins
- •Installing and Using PostgreSQL Extensions
- •Finding PostgreSQL Extensions
- •A Short List of Noteworthy Extensions
- •Auditing Changes with hstore
- •Introduction to hstore
- •Comparing hstores
- •Auditing Changes with a Trigger
- •Testing the Audit Trigger
- •From hstore Back to a Regular Record
- •Last.fm Million Song Dataset
- •Using Trigrams For Typos
- •The pg_trgm PostgreSQL Extension
- •Trigrams, Similarity and Searches
- •Complete and Suggest Song Titles
- •Trigram Indexing
- •Denormalizing Tags with intarray
- •Advanced Tag Indexing
- •User-Defined Tags Made Easy
- •The Most Popular Pub Names
- •A Pub Names Database
- •Normalizing the Data
- •Geolocating the Nearest Pub (k-NN search)
- •How far is the nearest pub?
- •The earthdistance PostgreSQL contrib
- •Pubs and Cities
- •The Most Popular Pub Names by City
- •Geolocation with PostgreSQL
- •Geolocation Data Loading
- •Geolocation Metadata
- •Emergency Pub
- •Counting Distinct Users with HyperLogLog
- •HyperLogLog
- •Installing postgresql-hll
- •Counting Unique Tweet Visitors
- •Lossy Unique Count with HLL
- •Getting the Visits into Unique Counts
- •Scheduling Estimates Computations
- •Combining Unique Visitors
- •An Interview with Craig Kerstiens
34 
Another Small Application
In a previous chapter when introducing arrays we used a dataset of 200,000 USA geolocated tweets with a very simple data model. The data model is a direct port of the Excel sheet format, allowing a straightforward loading process: we used the \copy command from psql.
1begin;
2
3create table tweet
4(
5id bigint primary key,
6 |
date |
date, |
7 |
hour |
time, |
8 |
uname |
text, |
9 |
nickname |
text, |
10 |
bio |
text, |
11 |
message |
text, |
12 |
favs |
bigint, |
13 |
rts |
bigint, |
14 |
latitude |
double precision, |
15 |
longitude |
double precision, |
16 |
country |
text, |
17 |
place |
text, |
18 |
picture |
text, |
19 |
followers |
bigint, |
20 |
following |
bigint, |
21 |
listed |
bigint, |
22 |
lang |
text, |
23 |
url |
text |
24 |
); |
|
25 |
|
|
26 |
\copy tweet from 'tweets.csv' with csv header delimiter ';' |
|
Chapter 34 Another Small Application j 294
27
28 commit;
This database model is all wrong per the normal forms introduced earlier:
•There’s neither a unique constraint nor primary key, so there is nothing preventing insertion of duplicates entries, violating 1NF.
•Some non-key attributes are not dependent on the key because we mix data from the Twitter account posting the message and the message itself, violating 2NF.
This is the case with all the user’s attributes, such as the nickname, bio, picture, followers, following, and listed attributes.
•We have transitive dependencies in the model, which violates 3NF this time.
– The country and place attributes depend on the location attribute and as such should be on a separate table, such as the geonam data as used in the Denormalized Data Types chapter.
–The hour attributes depend on the date attribute, as the hour alone can’t represent when the tweet was transmitted.
•The longitude and latitude should really be a single location column, given PostgreSQL’s ability to deal with geometric data types, here a point.
It is interesting to note that failing to respect the normal forms has a negative impact on application’s performance. Here, each time a user changes his or her bio, we will have to go edit the user’s bio in every tweet ever posted. Or we could decide to only give new tweets the new bio, but then at query time when showing an old tweet, it gets costly to fetch the current bio from the user.
From a concurrency standpoint, a normalized schema helps to avoid concurrent update activity on the same rows from occuring of en in production.
It’s now time to rewrite our schema, and here’s a rst step:
1begin;
2
3create schema if not exists tweet;
4
5create table tweet.users
6(
7 |
userid |
bigserial primary key, |
8 |
uname |
text not null, |
Chapter 34 Another Small Application j 295
9 |
nickname |
text not null, |
10 |
bio |
text, |
11 |
picture |
text, |
12 |
followers |
bigint, |
13 |
following |
bigint, |
14 |
listed |
bigint, |
15 |
|
|
16unique(uname)
17);
18
19create table tweet.message
20(
21id bigint primary key,
22 |
userid |
bigint references tweet.users(userid), |
23 |
datetime |
timestamptz not null, |
24 |
message |
text, |
25 |
favs |
bigint, |
26 |
rts |
bigint, |
27 |
location |
point, |
28 |
lang |
text, |
29 |
url |
text |
30 |
); |
|
31 |
|
|
32 |
commit; |
|
This model cleanly separates users and their messages and removes the attributes country and place, which we maintain separately in the geonames schema, as seen earlier.
That said, followers and following and listed elds are a summary of other information that we should have but don’t. The fact that the extract we worked with had a simpler statistics oriented schema shouldn’t blind us here. There’s a better way to register relationships between users in terms of who follows who and who lists who, as in the following model:
1begin;
2
3create schema if not exists tweet;
4
5create table tweet.users
6(
7 |
userid |
bigserial primary key, |
8 |
uname |
text not null, |
9 |
nickname |
text, |
10 |
bio |
text, |
11 |
picture |
text, |
12 |
|
|
13unique(uname)
14);
15
Chapter 34 Another Small Application j 296
16create table tweet.follower
17(
18 |
follower |
bigint |
not |
null |
references |
tweet.users(userid), |
19 |
following |
bigint |
not |
null |
references |
tweet.users(userid), |
20 |
|
|
|
|
|
|
21primary key(follower, following)
22);
23
24create table tweet.list
25(
26 |
listid |
bigserial primary key, |
27 |
owner |
bigint not null references tweet.users(userid), |
28 |
name |
text not null, |
29 |
|
|
30unique(owner, name)
31);
32
33create table tweet.membership
34(
35 |
listid |
bigint |
not |
null |
references tweet.list(listid), |
36 |
member |
bigint |
not |
null |
references tweet.users(userid), |
37 |
datetime |
timestamptz not |
null, |
||
38 |
|
|
|
|
|
39primary key(listid, member)
40);
41
42create table tweet.message
43(
44messageid bigserial primary key,
45 |
userid |
bigint not null references tweet.users(userid), |
46 |
datetime |
timestamptz not null default now(), |
47 |
message |
text not null, |
48 |
favs |
bigint, |
49 |
rts |
bigint, |
50 |
location |
point, |
51 |
lang |
text, |
52 |
url |
text |
53 |
); |
|
54 |
|
|
55 |
commit; |
|
Now we can begin to work with this model.
