- •LECTURE 8
- •What is Data Mining?
- •Typical Kinds of Patterns
- •Example: Clusters
- •Example: Frequent Itemsets
- •Applications (Among Many)
- •Cultures
- •Models vs. Analytic Processing
- •(Way too Simple) Example
- •Meaningfulness of Answers
- •Examples
- •Rhine Paradox --- (1)
- •Rhine Paradox --- (2)
- •Rhine Paradox --- (3)
- •What is Web Mining?
- •How does it differ from “classical” Data Mining?
- •The World-Wide Web
- •Size of the Web
- •Netcraft survey
- •The web as a graph
- •Power-law degree distribution
- •Power-laws galore
- •Searching the Web
- •Ads vs. search results
- •Ads vs. search results
- •Sidebar: What’s in a name?
- •The Long Tail
- •Web Mining topics
- •Web search basics
- •Search engine components
- •Knowledge Discovery in
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •What is Data Mining?
- •Data Mining Algorithms
- •Data Mining Algorithms
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Searching the Model Space
- •Searching the Model Space
- •THANK YOU
Examples
A big objection to TIA (Total Information Awareness) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy.
The Rhine Paradox: a great example of how not to conduct scientific research.
11
Rhine Paradox --- (1)
David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception.
He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.
He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!
12
Rhine Paradox --- (2)
He told these people they had ESP and called them in for another test of the same type.
Alas, he discovered that almost all of them had lost their ESP.
What did he conclude?Answer on next slide.
13
Rhine Paradox --- (3)
He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.
14
What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns
Applications
Web search e.g., Google, Yahoo,
…
Vertical Search e.g., FatLens, Become,…
Recommendations e.g., Amazon.com
Advertising e.g., Google, Yahoo
How does it differ from “classical” Data Mining?
The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional data warehouses
Ability to react in real-time to usage patterns
The World-Wide Web
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Huge |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Distributed content |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
creation, linking (no |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
coordination) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Structured databases, |
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
unstructured text, |
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
semistructured |
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Content includes truth, lies, |
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
obsolete information, |
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
contradictions, … |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Our modern-day Library of |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Web |
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alexandria |
Size of the Web
Number of pagesTechnically, infinite
Because of dynamically generated contentLots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion, Yahoo = 20 billionLots of marketing hype
Number of unique web sites
Netcraft survey says 625 million sites
(http://news.netcraft.com/archives/web_server_survey.html)
Netcraft survey
http://news.netcraft.com/archives/web_server_survey.html
The web as a graph
Pages = nodes, hyperlinks = edges
Ignore contentDirected graph
High linkage
8-10 links/page on averagePower-law degree distribution