Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Summary Final B_last.docx
Скачиваний:
0
Добавлен:
01.07.2025
Размер:
11.81 Кб
Скачать

3.2 Data preparation and aggregation.

Dataset was aggregated in a few views. All scripts considering with views creating can be found in APPENDIX A.

v_num_of_events_per_transaction - fixing set of cookie_id and session_id, and counting all considering rows. Helps to understand how many transactions do we have, and how many items per transaction.

v_number_of_events_per_cookie – aggregated from v_num_of_events_per_transaction by fixing cookie_id, counting all considering rows and summarize number of events. Contains information about number of sessions and number of events per visitor (cookie_id).

v_count_of_visitors_and_sessions_per_cookie - aggregated from v_number_of_events_per_cookie by grouping count of sessions and counting cookie(visitors). Contains count of visitor with same number of sessions.

v_count_of_visitors_and_sessions_per_event - aggregated from v_num_of_events_per_transaction by counting number of events and grouping by number of events. Shows number of events per transaction (event).

v_visitors_with_orders – aggregated from all dataset by selecting distinct ordered products. Contains only data about ordered or checkouted products.

v_commerce_without_noise_product – aggregated from all dataset by removing useless information stored in product_gui attribute. It contains only products that were viewed by visitors, without orders and without pages not considered with product (such as “/sales/order/history/”; “/lon-about-us”; etc)

Using thus aggregated views we obtain more sensible information. Now we have:

1. 211553 events without noise instead of 415863.

2. 6996 different products instead of 15207

3. 2283 different visitors that has ordered something

4. 10200 different events from users with orders (from p. 5).

So we reduced amount of data in 2 times.

To reduce amount of data we decided to aggregate v_relevant_subset_cookies view. It contains distinct cookies from v_number_of_events_per_cookie that has number of sessions less than 30 but more than 5. We remove all not frequent and too frequent visitors from subset to obtain more normalized data.

Finally we separate all visitors on three groups:

1. All users without orders and checkouts

2. Already subscribed users without orders and checkouts

3. Users that already ordered something.

Relevant subset was exported from views respectively:

1. v_relevant_subset_without_orders_all.

2. v_relevant_subset_without_orders_subscribed_users.

3. v_relevant_subset_orders_only.

All subset consist of two columns: session_id and product_gui.

Using this views we generate 3 baskets. After export to file we process all subsets in order to obtain *.basket format table. It implemented in sparse.py script. Then “normalize” the strings that describe each product, eliminating white spaces, eliminating accents, setting to lower case. It was implemented in _goPy_transform.py script. Using AssociationRulesSparseInducer module of Orange we generate rules with support threshold 1-2%. All the rules we get we sort by confidence, lift and leverage. So we have 3 reports for each of subsets. This was implemented in assoc1.py script.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]