Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

An_Introduction_to_Information_Retrieval.pdf

Скачиваний:

423

Добавлен:

26.03.2016

Размер:

6.9 Mб

Скачать

☆

<<< < Предыдущая 64 65 66 67 68 69 70 71 72 73 74 7576 / 12176 77 78 79 80 81 82 83 84 85 86 87 88 > Следующая >>>

286	13 Text classiﬁcation and Naive Bayes

13.7References and further reading

POINTWISE MUTUAL INFORMATION

UTILITY MEASURE

General introductions to statistical classiﬁcation and machine learning can be found in (Hastie et al. 2001), (Mitchell 1997), and (Duda et al. 2000), including many important methods (e.g., decision trees and boosting) that we do not cover. A comprehensive review of text classiﬁcation methods and results is (Sebastiani 2002). Manning and Schütze (1999, Chapter 16) give an accessible introduction to text classiﬁcation with coverage of decision trees, perceptrons and maximum entropy models. More information on the superlinear time complexity of learning methods that are more accurate than Naive Bayes can be found in (Perkins et al. 2003) and (Joachims 2006a).

Maron and Kuhns (1960) described one of the ﬁrst NB text classiﬁers. Lewis (1998) focuses on the history of NB classiﬁcation. Bernoulli and multinomial models and their accuracy for different collections are discussed by McCallum and Nigam (1998). Eyheramendy et al. (2003) present additional NB models. Domingos and Pazzani (1997), Friedman (1997), and Hand and Yu (2001) analyze why NB performs well although its probability estimates are poor. The ﬁrst paper also discusses NB’s optimality when the independence assumptions are true of the data. Pavlov et al. (2004) propose a modiﬁed document representation that partially addresses the inappropriateness of the independence assumptions. Bennett (2000) attributes the tendency of NB probability estimates to be close to either 0 or 1 to the effect of document length. Ng and Jordan (2001) show that NB is sometimes (although rarely) superior to discriminative methods because it more quickly reaches its optimal error rate. The basic NB model presented in this chapter can be tuned for better effectiveness (Rennie et al. 2003;Kołcz and Yih 2007). The problem of concept drift and other reasons why state-of-the-art classiﬁers do not always excel in practice are discussed by Forman (2006) and Hand (2006).

Early uses of mutual information and χ2 for feature selection in text classiﬁcation are Lewis and Ringuette (1994) and Schütze et al. (1995), respectively. Yang and Pedersen (1997) review feature selection methods and their impact on classiﬁcation effectiveness. They ﬁnd that pointwise mutual information is not competitive with other methods. Yang and Pedersen refer to expected mutual information (Equation (13.16)) as information gain (see Exercise 13.13, page 285). (Snedecor and Cochran 1989) is a good reference for the χ2 test in statistics, including the Yates’ correction for continuity for 2 × 2 tables. Dunning (1993) discusses problems of the χ2 test when counts are small. Nongreedy feature selection techniques are described by Hastie et al. (2001). Cohen (1995) discusses the pitfalls of using multiple signiﬁcance tests and methods to avoid them. Forman (2004) evaluates different methods for feature selection for multiple classiﬁers.

David D. Lewis deﬁnes the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21 based on Apté et al. (1994). Lewis (1995) describes utility measures for the

13.7 References and further reading

287

evaluation of text classiﬁcation systems. Yang and Liu (1999) employ significance tests in the evaluation of text classiﬁcation methods.

Lewis et al. (2004) ﬁnd that SVMs (Chapter 15) perform better on ReutersRCV1 than kNN and Rocchio (Chapter 14).

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.

289

CONTIGUITY HYPOTHESIS

Vector space classiﬁcation

The document representation in Naive Bayes is a sequence of terms or a binary vector he1, . . . , e|V|i {0, 1}|V|. In this chapter we adopt a different representation for text classiﬁcation, the vector space model, developed in Chapter 6. It represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space X, the domain of the classiﬁcation function γ, is R|V|. This chapter introduces a number of classiﬁcation methods that operate on real-valued vectors.

The basic hypothesis in using the vector space model for classiﬁcation is the contiguity hypothesis.

Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.

There are many classiﬁcation tasks, in particular the type of text classiﬁcation that we encountered in Chapter 13, where classes can be distinguished by word patterns. For example, documents in the class China tend to have high values on dimensions like Chinese, Beijing, and Mao whereas documents in the class UK tend to have high values for London, British and Queen. Documents of the two classes therefore form distinct contiguous regions as shown in Figure 14.1 and we can draw boundaries that separate them and classify new documents. How exactly this is done is the topic of this chapter.

Whether or not a set of documents is mapped into a contiguous region depends on the particular choices we make for the document representation: type of weighting, stop list etc. To see that the document representation is crucial, consider the two classes written by a group vs. written by a single person. Frequent occurrence of the ﬁrst person pronoun I is evidence for the single-person class. But that information is likely deleted from the document representation if we use a stop list. If the document representation chosen is unfavorable, the contiguity hypothesis will not hold and successful vector space classiﬁcation is not possible.

The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 and 7 also

290	14 Vector space classiﬁcation

China

x x

Kenya

Figure 14.1 Vector space classiﬁcation into three classes.

apply here. For example, a term with 5 occurrences in a document should get a higher weight than a term with one occurrence, but a weight 5 times larger would give too much emphasis to the term. Unweighted and unnormalized counts should not be used in vector space classiﬁcation.

We introduce two vector space classiﬁcation methods in this chapter, Rocchio and kNN. Rocchio classiﬁcation (Section 14.2) divides the vector space PROTOTYPE into regions centered on centroids or prototypes, one for each class, computed as the center of mass of all documents in the class. Rocchio classiﬁcation is simple and efﬁcient, but inaccurate if classes are not approximately spheres

with similar radii.

kNN or k nearest neighbor classiﬁcation (Section 14.3) assigns the majority class of the k nearest neighbors to a test document. kNN requires no explicit training and can use the unprocessed training set directly in classiﬁcation. It is less efﬁcient than other classiﬁcation methods in classifying documents. If the training set is large, then kNN can handle non-spherical and other complex classes better than Rocchio.

A large number of text classiﬁers can be viewed as linear classiﬁers – classiﬁers that classify based on a simple linear combination of the features (Section 14.4). Such classiﬁers partition the space of features into regions separated by linear decision hyperplanes, in a manner to be detailed below. Because of the bias-variance tradeoff (Section 14.6) more complex nonlinear models

<<< < Предыдущая 64 65 66 67 68 69 70 71 72 73 74 7576 / 12176 77 78 79 80 81 82 83 84 85 86 87 88 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
01.04.2025356.35 Кб0analiz.doc
#
01.05.2025448.51 Кб0Analiz_i_upravlenie_zatratami_Verkholantseva_O_...doc
#
02.06.20151.08 Mб6Anderson_Rio_Gangster.pdf
#
18.12.2018721.41 Кб6antigtu.ru-shpora_po_teorii_veroyatnosti_disper....doc
#
02.06.2015108.54 Кб13Antipeva_chto_to_25_04_14.doc
#
26.03.20166.9 Mб423An_Introduction_to_Information_Retrieval.pdf
#
02.06.2015833.24 Кб3APK_(01.01.2012).rtf
#
02.06.2015846.45 Кб4APK_(24.09.2012).rtf
#
26.03.2016355.36 Кб14Arabic_London.docx
#
01.05.2025175.1 Кб1Arkhitektura.doc
#
07.09.201923.88 Кб4Armenia.docx