Добавил:
ИВТ (советую зайти в "Несортированное") Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Database 2024 / Books / Искусство PostgreSQL.pdf
Скачиваний:
3
Добавлен:
20.11.2024
Размер:
1.62 Mб
Скачать

Chapter 22 PostgreSQL Data Types j 187

set, having this processing feature set right in PostgreSQL makes sense in several use cases:

• It makes sense when the SQL logic or ltering you want to implement depends on the result of the processing (e.g. grouping by week).

When you have several applications using the same logic, it’s of en easier to share a SQL query than to set up a distributed service API o fering the same result in XML or JSON (a data format you then have to parse).

When you want to reduce your run-time dependencies, it’s a good idea to understand how much each architecture layer is able to support in your implementation.

Network Address Types

PostgreSQL includes support for both cidr, inet, and macaddr data types. Again, those types are bundled with indexing support and advanced functions and operator support.

The PostgreSQL documentation chapters entitled Network Address Types and Network Address Functions and Operators cover network address types.

Web servers logs are a classic source of data to process where we nd network address types and The Honeynet Project has some free samples for us to play with. This time we’re using the Scan 34 entry. Here’s how to load the sample data set, once cleaned into a proper CSV le:

1begin;

2

3drop table if exists access_log;

4

5create table access_log

6(

7

ip

inet,

8

ts

timestamptz,

9request text,

10

status integer

11

);

12

 

13

\copy access_log from 'access.csv' with csv delimiter ';'

14

 

15

commit;

Chapter 22 PostgreSQL Data Types j 188

The script used to cleanse the original data into a CSV that PostgreSQL is happy about implements a pretty simple transformation from

211.141.115.145 - - [13/Mar/2005:04:10:18 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozill

into

"211.141.115.145";"2005-05-13 04:10:18 -0500";"GET / HTTP/1.1";"403"

Being mostly interested into network address types, the transformation from the Apache access log format to CSV is lossy here, we keep only some of the elds we might be interested into.

One of the things that’s possible to implement thanks to the PostgreSQL inet data type is an analysis of /24 networks that are to be found in the logs.

To enable that analysis, we can use the set_masklen() function which allows us to transforms an IP address into an arbitrary CIDR network address:

1select distinct on (ip)

2ip,

3set_masklen(ip, 24) as inet_24,

4 set_masklen(ip::cidr, 24) as cidr_24 5 from access_log

6limit 10;

And we can see that if we keep the data type as inet, we still get the full IP address with the /24 network notation added. To have the .0/24 notation we need to be using cidr:

ip │ inet_24 │ cidr_24

═══════════════╪══════════════════╪═════════════════

4.35.221.243 │ 4.35.221.243/24 │ 4.35.221.0/24 4.152.207.126 │ 4.152.207.126/24 │ 4.152.207.0/24 4.152.207.238 │ 4.152.207.238/24 │ 4.152.207.0/24

4.249.111.162

│ 4.249.111.162/24

│ 4.249.111.0/24

12.1.223.132

│ 12.1.223.132/24

│ 12.1.223.0/24

12.8.192.60

│ 12.8.192.60/24

│ 12.8.192.0/24

12.33.114.7

│ 12.33.114.7/24

│ 12.33.114.0/24

12.47.120.130

│ 12.47.120.130/24

│ 12.47.120.0/24

12.172.137.4

│ 12.172.137.4/24

│ 12.172.137.0/24

18.194.1.122

│ 18.194.1.122/24

│ 18.194.1.0/24

(10 rows)

Of course, note that you could be analyzing other networks than /24:

1select distinct on (ip)

2ip,

3

set_masklen(ip::cidr,

27)

as

cidr_27,

4

set_masklen(ip::cidr,

28)

as

cidr_28

Chapter 22 PostgreSQL Data Types j 189

5 from access_log

6limit 10;

This computes for us the proper starting ip addresses for our CIDR notation for us, of course. Af er all, what’s the point of using proper data types if not for advanced processing?

ip │ cidr_27 │ cidr_28

═══════════════╪══════════════════╪══════════════════

4.35.221.243 │ 4.35.221.224/27 │ 4.35.221.240/28 4.152.207.126 │ 4.152.207.96/27 │ 4.152.207.112/28 4.152.207.238 │ 4.152.207.224/27 │ 4.152.207.224/28

4.249.111.162

│ 4.249.111.160/27

│ 4.249.111.160/28

12.1.223.132

│ 12.1.223.128/27

│ 12.1.223.128/28

12.8.192.60

│ 12.8.192.32/27

│ 12.8.192.48/28

12.33.114.7

│ 12.33.114.0/27

│ 12.33.114.0/28

12.47.120.130

│ 12.47.120.128/27

│ 12.47.120.128/28

12.172.137.4

│ 12.172.137.0/27

│ 12.172.137.0/28

18.194.1.122

│ 18.194.1.96/27

│ 18.194.1.112/28

(10 rows)

Equipped with this set_masklen() function, it’s now easy to analyze our access logs using arbitrary CIDR network de nitions.

1select set_masklen(ip::cidr, 24) as network,

2count(*) as requests,

3

array_length(array_agg(distinct ip), 1) as ipcount

4

from access_log

5group by network

6 having array_length(array_agg(distinct ip), 1) > 1

7order by requests desc, ipcount desc;

In our case, we get the following result:

network │ requests │ ipcount

══════════════════╪══════════╪═════════

4.152.207.0/24

140

2

222.95.35.0/24

59

2

211.59.0.0/24

32

2

61.10.7.0/24

25

25

222.166.160.0/24

25

24

219.153.10.0/24

7

3

218.78.209.0/24

6

4

193.109.122.0/24 │

5

5

204.102.106.0/24

3

3

66.134.74.0/24

2

2

219.133.137.0/24

2

2

61.180.25.0/24

2

2

(12 rows)

Chapter 22 PostgreSQL Data Types j 190

Ranges

Range types are a unique feature of PostgreSQL, managing two dimensions of data in a single column, and allowing advanced processing. The main example is the daterange data type, which stores as a single value a lower and an upper bound of the range as a single value. This allows PostgreSQL to implement a concurrent safe check against overlapping ranges, as we’re going to see in the next example.

As usual, read the PostgreSQL documentation chapters with the titles Range Types and Range Functions and Operators for complete information.

The International Monetary Fund publishes exchange rate archives by month for lots of currencies. An exchange rate is relevant from its publication until the next rate is published, which makes a very good use case for our PostgreSQL range types.

The following SQL script is the main part of the ELT script that has been used for this book. Only missing from this book’s pages is the transformation script that pivots the available tsv le into the more interesting format we use here:

1begin;

2

3create schema if not exists raw;

4

5 -- Must be run as a Super User in your database instance

6-- create extension if not exists btree_gist;

7

8drop table if exists raw.rates, rates;

9

10create table raw.rates

11(

12currency text,

13

date

date,

14

rate

numeric

15

);

 

16

 

 

17

\copy raw.rates from 'rates.csv' with csv delimiter ';'

18

 

 

19create table rates

20(

21currency text,

22validity daterange,

23

rate

numeric,

24

 

 

25

exclude using gist (currency with =,

Chapter 22 PostgreSQL Data Types j 191

26

validity with &&)

27

);

28

 

29insert into rates(currency, validity, rate)

30select currency,

31

daterange(date,

32

lead(date) over(partition by currency

33

order by date),

34

'[)'

35

)

36

as validity,

37rate

38from raw.rates

39order by date;

40

 

41

commit;

In this SQL script, we rst create a target table for loading the CSV le. The le contains lines with a currency name, a date of publication, and a rate as a numeric value. Once the data is loaded into this table, we can transform it into something more interesting to work with from an application, the rat table.

The rat table registers the rate value for a currency and a validity period, and uses an exclusion constraint that guarantees non-overlapping validity periods for any given currency:

1exclude using gist (currency with =, validity with &&)

This expression reads: exclude any tuple where the currency is = to an existing currency in our table AND where the validity is overlapping with (&&) any existing validity in our table. This exclusion constraint is implemented in PostgreSQL using a GiST index.

By default, GiST in PostgreSQL doesn’t support one-dimensional data types that are meant to be covered by B-tree indexes. With exclusion constraints though, it’s very interesting to extend GiST support for one-dimensional data types, and so we install the btree_gist extension, provided in PostgreSQL contrib package.

The script then lls in the rat table from the raw.rat we’d been importing in the previous step. The query uses the lead() window function to implement the speci cation spelled out in English earlier: an exchange rate relevant from its publication until the next rate published.

Here’s how the data looks, with the following query targeting Euro rates:

1 select currency, validity, rate

Chapter 22 PostgreSQL Data Types j 192

2from rates

3

where

currency

= 'Euro'

4

order by

validity

 

5limit 10;

We can see that the validity is a range of dates, and the standard output for this

type is a closed range which includes the

rst entry and excludes the second one:

currency

validity

rate

══════════╪═════════════════════════╪══════════

Euro

│ [2017-05-02,2017-05-03) │ 1.254600

Euro

│ [2017-05-03,2017-05-04) │ 1.254030

Euro

│ [2017-05-04,2017-05-05) │ 1.252780

Euro

│ [2017-05-05,2017-05-08) │ 1.250510

Euro

│ [2017-05-08,2017-05-09) │ 1.252880

Euro

│ [2017-05-09,2017-05-10) │ 1.255280

Euro

│ [2017-05-10,2017-05-11) │ 1.255300

Euro

│ [2017-05-11,2017-05-12)

│ 1.257320

Euro

│ [2017-05-12,2017-05-15)

│ 1.255530

Euro

│ [2017-05-15,2017-05-16)

│ 1.248960

(10 rows)

 

 

 

 

Having this data set with the exclusion constraint means that we know we have at most a single rate available at any point in time, which allows an application needing the rate for a speci c time to write the following query:

\index{Operators!@}

1 select rate

2from rates

3where currency = 'Euro'

4and validity @> date '2017-05-18';

The operator @> reads contains, and PostgreSQL uses the exclusion constraint’s index to solve that query e ciently:

rate

══════════

1.240740 (1 row)

Соседние файлы в папке Books