Project report on
“Final Project – B”
Submitted by Group 5:
Sanzhar Aubakirov, A40395
Under The Guidance Of:
Prof. Paulo Trigo Silva
Course:
Department of Engineering in Electronic and Telecommunications and Computer
Discipline:
Machine Learning and Data Mining
January, 2012
1. Introduction
Our company had a meeting with the “We-Commerce” company. “We-Commerce” is a company that provides on-line-shopping. Their innovation group asked us for technical advises on how to generate knowledge from the large volume of data that “We-Commerce” has been storing throughout time. After the meeting the “We-Commerce” sent us a report about the process they use to collect data and dataset with some of the collected data.
We have a huge data of information about each visitors actions in January 2012. There is information about visitors browser, product that they are visited, time stamp of event (visit), IP address and so on.
After analyzing the report and dataset we decided to consider this problem as a “market-basket analysis” problem, where sessions consider as transaction and products as items. We will generate association rules using the APRIORI algorithm.
2. Project plan
1. Dataset analysis.
Analyze data to understand how to prepare it and what knowledge could be obtained.
2. Data preparation and aggregation. Generate a subset of the most relevant data.
Process the data with the recorded transactions to reduce noise and useless information. Aggregate data to obtain new information. Also we have to generate a subset of data to reduce complexity of computation. The problem is to generate a relevant subset, which means it would be subset with the same characteristics as a parent set and rules obtained from this subset can be applied to all data. Export subsets into file. Prepare exported data for Orange framework suitable format.
3. Generate rules using APRIORI algorithm.
Generate all possible association rules. Then tuning an options of algorithm in order to get the strongest rules. Generating a reports.
3.1 Dataset analysis.
After analyzing dataset we divide all attribute in two groups: useless and useful for our goal.
Useful attributes:
cookie_id - value of visitors cookie. Unique identifier of visitor.
session_id – session identifier. New session identifier is created whenever a visitor arrives at one of our pages. What we call “event” is occurrences in dataset with one cookie_id and one session_id. So each set of cookie_id and session_id may get associated with a list of different events, and each “cookie_id” may get associated with a list of different “session_id” values. All events with fixed cookie_id and session_id is “transactions” (in terms of market-basket analysis).
user_gui - user unique identifier. This attribute is not empty if user already subscribed
product_gui - product unique identifier that gets visited. It is our items (in terms of market-basket analysis). This attribute has very much noise(useless values) such as:
'open', 'home', '/onestepcheckout/', '/checkout/cart/index/', '/lon-about-us', '/lon-contacts', 'display.category*homepage', '/ljv-contacts', '/sales/order/history/', etc.
Useless attributes:
IP – IP address of visitor.
tracking_record_id – unique identifier of the row. We say that each record represents an “event” (or a visit).
date_time - timestamp of the event.
campaign_id - the identification of a promotional campaign whenever a product belongs to a campaign.
company - the name of the company that provides the product (“product_gui”).
link and refer – URL of visited web page.
browser – visitors browser.
tracking_id
meio
File z_dataset_201201.csv was imported in postgresql database into commerce table, to make it easier to manipulate with data.
Dataset consist of:
1. 415863 total events (rows).
2. 263137 different cookie_id
3. 292750 different session_id
4. 15207 different product_gui. (this value is not accurate because of noise)
5. 2283 different visitors that has ordered something
6. 15508 different events from users with orders (from p.5)
We will use fixed set of cookie_id and session_id as transaction, product_gui as items of transactions. user_gui will be used for generate to relevant subset of all dataset: subset without non-subscribed users and subset with all users.
