20 Web crawling and indexes

20.1Overview

WEB CRAWLER SPIDER

Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efﬁciently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19 we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter we study the resulting difﬁculties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.

The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substantial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is addressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisﬁes these features. Section 20.3 discusses distributing indexes across many machines for

a web-scale implementation.

20.1.1Features a crawler must provide

We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.

Robustness: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an inﬁnite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.

<<< < Предыдущая 97 98 99 100 101 102 103 104 105 106 107 108109 / 121109 110 111 112 113 114 115 116 117 118 119 120 121 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
02.06.2015108.48 Кб4Amirbekov.pdf
#
02.06.2015557.57 Кб60An Intensive Course of English Writing.doc
#
02.06.20151.08 Mб5Anderson_Rio_Gangster.pdf
#
18.12.2018721.41 Кб4antigtu.ru-shpora_po_teorii_veroyatnosti_disper....doc
#
02.06.2015108.54 Кб12Antipeva_chto_to_25_04_14.doc
#
26.03.20166.9 Mб419An_Introduction_to_Information_Retrieval.pdf
#
02.06.2015833.24 Кб3APK_(01.01.2012).rtf
#
02.06.2015846.45 Кб4APK_(24.09.2012).rtf
#
26.03.2016355.36 Кб13Arabic_London.docx
#
07.09.201923.88 Кб4Armenia.docx
#
02.06.2015141.86 Кб3article1381160542_Unegbu and Tasie.pdf