Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Corpus annotation.doc
Скачиваний:
1
Добавлен:
20.08.2019
Размер:
72.19 Кб
Скачать

Example of part-of-speech tagging, lob corpus:

hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC

not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB

in_IN rows_NNS in_IN the_ATI cellar_NN !_!

the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ

cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD

comparatively_RB little_AP to_TO sing_VB

'_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD

Rollinson_NP ._.

Key to tags

[Note that in this scheme, as in others for written text, punctuation marks count as `words', and are themselves given tags.]

For more information on the CLAWS tagger, see Garside (1987) and Leech, Garside and Bryant (1994) (html version).

Grammatical parsing

Part-of-speech tagging is often seen as the first stage of a more comprehensive syntactic annotation, which assigns a phrase marker, or labelled bracketing, to each sentence of the corpus, in the manner of a phrase structure grammar. The resulting parsed corpora are known, for obvious reasons, as `treebanks'.

Grammatical annotation at Lancaster originated with Geoffrey Sampson's manual parsing of the Lancaster-Leeds Treebank. This scheme used a detailed system of labelled brackets, which distinguished between, for example, singular and plural noun phrases (Sampson, 1987a). The second phase was the scheme adopted for the Lancaster Parsed Corpus (initially tagged by the probabilistic parser developed in 1983-86 and subsequently corrected manually), which used a reduced set of constituents (Garside, Leech and V'aradi, 1992). Currently, UCREL employs a technique known as skeleton parsing. This simplified grammatical analysis uses an even smaller set of grammatical categories. Texts are parsed by hand using a program called EPICS written by Roger Garside. EPICS speeds up the manual parsing process by storing the set of constituents which are open at a particular point in the text, and the human operator then closes these constituents or opens additional ones at appropriate points. EPICS aims at parsing with a minimum of key presses: at `full stretch' operators can parse sentences averaging more than 20 words in length at a rate of less than a minute per sentence (Leech and Garside, 1991).

Example of skeleton parsing, from the spoken english corpus

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]

Key to Parsing Symbols. Square brackets enclose constituents above word level.

Wordtags are linked to their words by `_'. The tagset used here is the revised version of the earlier CLAWS tagset, known as the `CLAWS2 tagset'.

Unlabelled brackets indicate a constituent for which no label is provided by the annotation scheme. (Annotators are expressly allowed to identify constituents without having to identify the category to which they belong.)

Semantic tagging

Beyond grammatical annotations, semantic annotation is an obvious next step. For example, semantic word-tagging can be designed with the limited (though ambitious enough) goal of distinguishing the lexicographic senses of same word: a procedure also known as `sense resolution'.

The ACASD semantic tagging system (Wilson and Rayson, 1993) accepts as input text which has been tagged for part of speech using the CLAWS POS tagging system. The tagged text is fed into the main semantic analysis program (SEMTAG), which assigns semantic tags representing the general sense field of words from a lexicon of single words and an idiom list of multi-word combinations (e.g. as a rule), which are updated as new texts are analyzed. (Items not contained in the lexicon or idiom list are assigned a special tag, Z99, to assist in updating and manual postediting.) The tags for each entry in the lexicon and idiom list are arranged in general rank frequency order for the language. The text is manually pre-scanned to determine which semantic domains are dominant; the codes for these major domains are entered into a file called the `disam' file and are promoted to maximum frequency in the tag lists for each word where present. This combination of general frequency data and promotion by domain, together with heuristics for identifying auxiliary verbs, considerably reduces mistagging of ambiguous words. (Further work will attempt to develop more sophisticated probabilistic methods for disambiguation.) After automatic tag assignment has been carried out, manual postediting takes place, if desired, to ensure that each word and idiom carries the correct semantic classification (SEMEDIT). A program (MATRIX) then marks key lexical relations (e.g. negation, modifier + adjective, and adjective + noun combinations). The following is an example of semantic word-tagging, taken from the automatic content analysis project at Lancaster:

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]