Добавил:
ИВТ (советую зайти в "Несортированное") Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Database 2024 / Books / Искусство PostgreSQL.pdf
Скачиваний:
3
Добавлен:
20.11.2024
Размер:
1.62 Mб
Скачать

Chapter 50 Counting Distinct Users with HyperLogLog j 418

3

2018-08-15

3308

3227

81

3

2018-08-16

3261

3184

77

(10 rows)

Lossy Unique Count with HLL

We can rewrite the previous query using our HLL data type now, even though at this stage it’s not going to be very useful, because we still have the full logs of every visit and we can a ford to compute precise counts.

Nonetheless, our goal is to dispose of the daily entries, that we anticipate will be just too large a data set. So, the hll-based query looks like this:

1select messageid,

2datetime::date as date,

3

# hll_add_agg(hll_hash_text(ipaddr::text)) as hll

4

from tweet.visitor

5where messageid = 3

6group by grouping sets((messageid),

7

(messageid, date))

8

order by messageid, date nulls first

9limit 10;

In this query we use several new functions and operators related to the hll data type:

The # operator takes a single argument: it’s a unary operator, like factorial (written !) for example. This unary operator when applied to a value of type hll computes the estimated number of distinct entries stored in the hyperloglog set.

The hll_add_agg() aggregate function accumulates new hashes into a given hyperloglog set.

The hll_hash_text function computes the hyperloglog hash of a text

value, here used with the IP address as a text form. We could also use the IP address as a 32-bit integer with the hll_hash_integer function instead, but then this wouldn’t support IPv6 addresses, which only t in a 128-bit number.

The notation # hll shows the level of exibility that PostgreSQL brings to the table with its extensibility support. Not only can you de ne new operators at runtime from an extension, but those operators can also be unary or binary.

Chapter 50 Counting Distinct Users with HyperLogLog j 419

The lossy distinct count result looks like this:

messageid │ date │ hll

═══════════╪════════════╪══════════════════

32018-08-07 │ 739.920627061887

32018-08-08 │ 3284.16386418662

32018-08-09 │ 3196.58757626223

32018-08-10 │ 3036.32707701154

32018-08-11 │ 3140.21704515932

32018-08-12 │ 3191.83031512197

32018-08-13 │ 3045.15467688584

32018-08-14 │ 3031.92750496513

32018-08-15 │ 3135.58879460201

32018-08-16 │ 3230.20146096767

(10 rows)

When used that way, the hll feature set doesn’t make much sense. We still have to process as many rows as before, but we lose some precision in the result. The reason why we’ve done that query here is to show the following:

1.Demonstrate how to use the hll operators and functions in a query

2.Show that the estimates from the hll data structures are pretty good, even at this low cardinality

Getting the Visits into Unique Counts

In a production setup we would have the following context and constraints:

Tweets are published and users from the Internet are visiting our tweets.

Our application inserts a new row in tweet.visitor with the visitor’s IP address each time there is a new visit to one of our tweet. It also registers the precise timestamp of the visit.

As we anticipate quite some success on our little application idea, we also

anticipate not being able to keep all the visitor logs, and not being able to respect our quality of service terms when computing the unique visitors on the y each time someone needs them.

Finally, as the numbers being used in a marketing context rather than in an invoicing context, we are in a position to lose some precision over the number, and we would actually like to implement a system that is lossy if it allows us to relax our storage and processing requirements.

Chapter 50 Counting Distinct Users with HyperLogLog j 420

The previous sections present a great tool for achieving the last point above, and now is the time to put hll to good use. From the tweet.visitor table we are now going to compute a single hyperloglog value per message and per day:

1begin;

2

3with new_visitors as

4(

5delete from tweet.visitor

6where id = any (

7

select

id

8

from

tweet.visitor

9

order by

datetime, messageid

10

for update

 

11

skip locked

 

12

limit

1000

13

)

 

14

returning messageid,

 

15

cast(datetime as date) as date,

16

hll_hash_text(ipaddr::text) as visitors

17),

18new_visitor_groups as

19(

20select messageid, date, hll_add_agg(visitors) as visitors

21from new_visitors

22group by messageid, date

23)

24insert into tweet.uniques

25select messageid, date, visitors

26from new_visitor_groups

27on conflict (messageid, date)

28do update set visitors = hll_union(uniques.visitors, excluded.visitors)

29

where

uniques.messageid = excluded.messageid

30

and

uniques.date

=

excluded.date

31

returning messageid, date,

#

visitors as uniques;

32

 

 

 

 

33

rollback;

 

 

 

This query is implemented in several stages thanks to the PostgreSQL support for writable common table expressions:

1.Compute new_visitors by deleting from the buffer table tweet.visitor a thousand rows at a time, and using the skip locked facility that is new in PostgreSQL 9.5.

By default, when attempting to delete a row that is already in use by another transaction doing either an update or a delete, PostgreSQL would have to block until the other transaction released its lock. With the skip locked clause, PostgreSQL can omit the row from the current transaction

Chapter 50 Counting Distinct Users with HyperLogLog j 421

without incurring any locking or waiting.

Rows skipped that way may appear in the next batch, or they may already be concurrently processed in another batch.

This construct allows the query to be run in more than one transaction at the same time, which might in turn be useful if we ever have some serious lag in our processing.

2. This rst CTE of our query then also computes the date from the timestamp with a CAST expression, and the hll hash from the IP address, preparing for the next stage of processing.

3.Compute the new_visitor_groups by aggregating the just computed hll individual hashes into a single hll set per messageid and per date.

4.Finally, insert those messages daily unique visitors hll sets into our summary table tweet.uniques. Of course, if we did compute a set for the same message and the same day before, we then update and hll_union the existing and the new set together.

5.Because PostgreSQL is such a powerful system, of course we return the result of processing the given batch at the end of the query, using the returning clause of the insert command.

The do update set clause requires that any single row in the target table be updated only once per command, in order to ensure that the con ict handling mechanism is deterministic. That’s the reason why we prepare the hll sets in the

new_visitor_groups CTE part of the query.

When running this query, we obtain the following result:

BEGIN

messageid │ date │ uniques

═══════════╪════════════╪══════════════════

32018-08-07 │ 739.920627061887

32018-08-08 │ 257.534468469694

(2 rows)

INSERT 0 2

ROLLBACK

Notice that we nish our script with a rollback command. That allows us to debug and re ne the query until we’re happy. This 5-stage, 29-line SQL query isn’t going to be too complex to maintain thanks to its actions being well separated using CTE, it still doesn’t get written in a single session in a text le. It gets

Соседние файлы в папке Books