Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Bioinformatics_lectures / lecture8.pptx
Скачиваний:
7
Добавлен:
21.02.2016
Размер:
369.61 Кб
Скачать

Examples

A big objection to TIA (Total Information Awareness) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy.

The Rhine Paradox: a great example of how not to conduct scientific research.

11

Rhine Paradox --- (1)

David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception.

He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.

He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

12

Rhine Paradox --- (2)

He told these people they had ESP and called them in for another test of the same type.

Alas, he discovered that almost all of them had lost their ESP.

What did he conclude?Answer on next slide.

13

Rhine Paradox --- (3)

He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

14

What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns

Applications

Web search e.g., Google, Yahoo,

Vertical Search e.g., FatLens, Become,…

Recommendations e.g., Amazon.com

Advertising e.g., Google, Yahoo

How does it differ from “classical” Data Mining?

The web is not a relation

Textual information and linkage structure

Usage data is huge and growing rapidly

Google’s usage logs are bigger than their web crawl

Data generated per day is comparable to largest conventional data warehouses

Ability to react in real-time to usage patterns

The World-Wide Web

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Huge

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Distributed content

 

 

 

 

 

 

 

 

 

 

 

 

 

 

creation, linking (no

 

 

 

 

 

 

 

 

 

 

 

 

 

 

coordination)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Structured databases,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

unstructured text,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

semistructured

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Content includes truth, lies,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

obsolete information,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

contradictions, …

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Our modern-day Library of

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Web

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Alexandria

Size of the Web

Number of pagesTechnically, infinite

Because of dynamically generated contentLots of duplication (30-40%)

Best estimate of “unique” static HTML pages comes from search engine claims

Google = 8 billion, Yahoo = 20 billionLots of marketing hype

Number of unique web sites

Netcraft survey says 625 million sites

(http://news.netcraft.com/archives/web_server_survey.html)

Netcraft survey

http://news.netcraft.com/archives/web_server_survey.html

The web as a graph

Pages = nodes, hyperlinks = edges

Ignore contentDirected graph

High linkage

8-10 links/page on averagePower-law degree distribution

Соседние файлы в папке Bioinformatics_lectures