Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Corpus annotation.doc
Скачиваний:
1
Добавлен:
20.08.2019
Размер:
72.19 Кб
Скачать

Corpus annotation

UCREL has expertise in applying the following kinds of annotation to corpora:

  • Part-of-speech (POS) tagging

  • Grammatical parsing

  • Semantic tagging

  • Anaphoric annotation

  • Prosodic annotation

Part-of-speech (pos) tagging

Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed at Lancaster. Our POS tagging software, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s.

Although the structure of CLAWS has seen some changes since the first version was produced, it still consists of three stages: pre-edit, automatic tag assignment, and manual post-edit. In the pre-edit stage the machine readable text is automatically converted to a suitable format for the tagging program. The text is then passed to the tagging program which assigns a part-of-speech tag to each word or word combination in the text. To assist it in this task, CLAWS has a lexicon of words with their possible parts of speech, and a further list of multi-word syntactic idioms (e.g. the subordinator in that). These databases are constantly updated as new texts are analyzed. To deal with words which are not in these databases, CLAWS uses various heuristics including a suffix list of common word suffixes with their possible parts of speech. Because one orthographic form may have several possible parts of speech (e.g. love can be a verb or a noun), after the initial assignment of possible parts of speech to the words in the text, CLAWS uses a probability matrix derived from large bodies of tagged and manually corrected texts to disambiguate the words in the text. The matrix specifies transition probabilities between adjacent tags, for example given that x is an adjective, what is the probability that the item to its immediate right is a noun? Again, these probabilities are constantly updated from new data. CLAWS tracks through each sentence in turn applying these probabilities. Finally manual post-editing using a special tag editor may take place if desired to correct fully the machine output. The CLAWS system enjoys a success rate in the region of 96%-97% on written texts, and is also successful, though to a slightly lesser degree, on spoken texts.

Several tagsets have been used in CLAWS over the years. The CLAWS1 tagset has 132 basic wordtags, many of them identical in form and application to Brown Corpus tags. A revision of CLAWS at Lancaster in 1983-6 resulted in a new, much revised, tagset of 166 word tags, known as the `CLAWS2 tagset'. The tagset for the British National Corpus (C5 tagset) has just over 60 tags. This tagset was kept small because it was designed for handling much larger quantities of data than were dealt with up to that point (see Leech, Garside and Bryant, 1994). For the BNC sampler corpus the enriched C6 tagset was used which has over 160 tags. The following is an example of CLAWS analysis using the CLAWS1 tagset:

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]