Добавил:

vadikbee ИВТ (советую зайти в "Несортированное") Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «МИЭТ»

Предмет:

Базы данных

Файл:

Database 2024 / Books / Искусство PostgreSQL.pdf

Скачиваний:

Добавлен:

20.11.2024

Размер:

1.62 Mб

Скачать

☆

<<< < Предыдущая 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 8788 / 9088 89 90 > Следующая >>>

Chapter 50 Counting Distinct Users with HyperLogLog j 418

3	│	2018-08-15	│	3308	│	3227	│	81
3	│	2018-08-16	│	3261	│	3184	│	77

(10 rows)

Lossy Unique Count with HLL

We can rewrite the previous query using our HLL data type now, even though at this stage it’s not going to be very useful, because we still have the full logs of every visit and we can a ford to compute precise counts.

Nonetheless, our goal is to dispose of the daily entries, that we anticipate will be just too large a data set. So, the hll-based query looks like this:

1select messageid,

2datetime::date as date,

3	# hll_add_agg(hll_hash_text(ipaddr::text)) as hll
4	from tweet.visitor

5where messageid = 3

6group by grouping sets((messageid),

7	(messageid, date))
8	order by messageid, date nulls first

9limit 10;

In this query we use several new functions and operators related to the hll data type:

•The # operator takes a single argument: it’s a unary operator, like factorial (written !) for example. This unary operator when applied to a value of type hll computes the estimated number of distinct entries stored in the hyperloglog set.

•The hll_add_agg() aggregate function accumulates new hashes into a given hyperloglog set.

•The hll_hash_text function computes the hyperloglog hash of a text

value, here used with the IP address as a text form. We could also use the IP address as a 32-bit integer with the hll_hash_integer function instead, but then this wouldn’t support IPv6 addresses, which only t in a 128-bit number.

The notation # hll shows the level of exibility that PostgreSQL brings to the table with its extensibility support. Not only can you de ne new operators at runtime from an extension, but those operators can also be unary or binary.

Chapter 50 Counting Distinct Users with HyperLogLog j 419

The lossy distinct count result looks like this:

messageid │ date │ hll

═══════════╪════════════╪══════════════════

3│ 2018-08-07 │ 739.920627061887

3│ 2018-08-08 │ 3284.16386418662

3│ 2018-08-09 │ 3196.58757626223

3│ 2018-08-10 │ 3036.32707701154

3│ 2018-08-11 │ 3140.21704515932

3│ 2018-08-12 │ 3191.83031512197

3│ 2018-08-13 │ 3045.15467688584

3│ 2018-08-14 │ 3031.92750496513

3│ 2018-08-15 │ 3135.58879460201

3│ 2018-08-16 │ 3230.20146096767

(10 rows)

When used that way, the hll feature set doesn’t make much sense. We still have to process as many rows as before, but we lose some precision in the result. The reason why we’ve done that query here is to show the following:

1.Demonstrate how to use the hll operators and functions in a query

2.Show that the estimates from the hll data structures are pretty good, even at this low cardinality

Getting the Visits into Unique Counts

In a production setup we would have the following context and constraints:

•Tweets are published and users from the Internet are visiting our tweets.

•Our application inserts a new row in tweet.visitor with the visitor’s IP address each time there is a new visit to one of our tweet. It also registers the precise timestamp of the visit.

•As we anticipate quite some success on our little application idea, we also

anticipate not being able to keep all the visitor logs, and not being able to respect our quality of service terms when computing the unique visitors on the y each time someone needs them.

•Finally, as the numbers being used in a marketing context rather than in an invoicing context, we are in a position to lose some precision over the number, and we would actually like to implement a system that is lossy if it allows us to relax our storage and processing requirements.

Chapter 50 Counting Distinct Users with HyperLogLog j 420

The previous sections present a great tool for achieving the last point above, and now is the time to put hll to good use. From the tweet.visitor table we are now going to compute a single hyperloglog value per message and per day:

1begin;

3with new_visitors as

5delete from tweet.visitor

6where id = any (

7	select	id
8	from	tweet.visitor
9	order by	datetime, messageid
10	for update
11	skip locked
12	limit	1000
13	)
14	returning messageid,
15	cast(datetime as date) as date,
16	hll_hash_text(ipaddr::text) as visitors

17),

18new_visitor_groups as

19(

20select messageid, date, hll_add_agg(visitors) as visitors

21from new_visitors

22group by messageid, date

23)

24insert into tweet.uniques

25select messageid, date, visitors

26from new_visitor_groups

27on conflict (messageid, date)

28do update set visitors = hll_union(uniques.visitors, excluded.visitors)

29	where	uniques.messageid = excluded.messageid
30	and	uniques.date	=	excluded.date
31	returning messageid, date,		#	visitors as uniques;
32
33	rollback;

This query is implemented in several stages thanks to the PostgreSQL support for writable common table expressions:

1.Compute new_visitors by deleting from the buffer table tweet.visitor a thousand rows at a time, and using the skip locked facility that is new in PostgreSQL 9.5.

By default, when attempting to delete a row that is already in use by another transaction doing either an update or a delete, PostgreSQL would have to block until the other transaction released its lock. With the skip locked clause, PostgreSQL can omit the row from the current transaction

Chapter 50 Counting Distinct Users with HyperLogLog j 421

without incurring any locking or waiting.

Rows skipped that way may appear in the next batch, or they may already be concurrently processed in another batch.

This construct allows the query to be run in more than one transaction at the same time, which might in turn be useful if we ever have some serious lag in our processing.

2. This rst CTE of our query then also computes the date from the timestamp with a CAST expression, and the hll hash from the IP address, preparing for the next stage of processing.

3.Compute the new_visitor_groups by aggregating the just computed hll individual hashes into a single hll set per messageid and per date.

4.Finally, insert those messages daily unique visitors hll sets into our summary table tweet.uniques. Of course, if we did compute a set for the same message and the same day before, we then update and hll_union the existing and the new set together.

5.Because PostgreSQL is such a powerful system, of course we return the result of processing the given batch at the end of the query, using the returning clause of the insert command.

The do update set clause requires that any single row in the target table be updated only once per command, in order to ensure that the con ict handling mechanism is deterministic. That’s the reason why we prepare the hll sets in the

new_visitor_groups CTE part of the query.

When running this query, we obtain the following result:

BEGIN

messageid │ date │ uniques

═══════════╪════════════╪══════════════════

3│ 2018-08-07 │ 739.920627061887

3│ 2018-08-08 │ 257.534468469694

(2 rows)

INSERT 0 2

ROLLBACK

Notice that we nish our script with a rollback command. That allows us to debug and re ne the query until we’re happy. This 5-stage, 29-line SQL query isn’t going to be too complex to maintain thanks to its actions being well separated using CTE, it still doesn’t get written in a single session in a text le. It gets

<<< < Предыдущая 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 8788 / 9088 89 90 > Следующая >>>

Соседние файлы в папке Books

#
20.11.20245.53 Mб4Базы данных Кузнецов.pdf
#
20.11.20241.19 Mб3БД Илюшечкины.doc
#
20.11.202415.89 Mб4Документация к PostgreSQL 15.1.pdf
#
20.11.202416.84 Mб3Документация к PostgreSQL 16.4.pdf
#
20.11.20241.62 Mб3Искусство PostgreSQL.pdf
#
20.11.20246.87 Mб3Мониторинг PostgreSQL.pdf