7. Computer Techniques

Alongside with the "routine" use of computers in such areas as numerical counting, statistical analysis and pattern matching, Linguistics provides a range of opportunities for the manipulation of non-numerical data, using natural language texts. Some of these tasks are indexing and concordancing; speech recognition and synthesis, machine translation and language learning.

Since the 1980s the chief focus of computational linguistic research has been in the area known as natural language processing (NLP). Here the aim is to devise techniques which will automatically analyze large quantities of spoken (transcribed) or written text in ways broadly parallel to what happens when humans carry out this task. NLP deals with the computational processing of text - both its understanding and its generation in natural human languages. It thus forms a major part of the domain of Computational Linguistics; but it is not to be identified with it, as computers can also be used for many other purposes in Linguistics, such as the processing of statistical data in authorship studies.

The field of NLP emerged out of machine translation in the 1950s and was later much influenced by work on artificial intelligence. There was a focus on devising "intelligent programs" (or "expert systems") which aimed to stimulate aspects of human behaviour, such as the way people can infer meaning from what has been said, or use their knowledge of the world to reach a conclusion.

Most recently particular attention has been paid to the nature of discourse (in the sense of text beyond the sentence) and there has been confrontation with the vast size of the lexicon, using the large amounts of lexical data now available in machine-readable form from commercial dictionary projects.

Progress has been considerable, but successful programs are still experimental in character, largely dealing with restricted tasks in well-defined settings. There is still a long way to go before computer systems can get anywhere near the flexible and creative world of real conversation, with its often figurative expression and ill-formed conversation.

8. Corpora – the method of Corpus analysis is the most modern method of obtaining linguistic data.

A corpus – a large collection of written and spoken texts – is a representative sample, compiled for the purpose of linguistic analysis. A corpus enables the linguist to make objective statements about frequency of usage, and it provides accessible data for the use of different researches. Its range and size are variable. Some corpora attempt to cover the language as a whole, taking extracts from many kinds of text; others are extremely selective, providing a collection of material that deals only with a particular linguistic feature. The size of a corpus depends on practical factors such as the time available to collect, process, and store the data: it can take up to several hours to provide an accurate transcription of a few minutes of speech. Sometimes a small sample of data will be enough to decide a linguistic hypothesis; corpora in major research projects can total millions of running words.

A standard corpus is a large collection of data available for use by many researchers. In English Linguistics all computer corpora are in a machine-readable form, and thus, available anywhere in the world.

Theme 2

LECTURE # 2. The method of Corpus Analysis. The British National Corpus

1. General notes: benefits, purpose, definitions.

2. Design of the corpus.

3. Selection features.

4. Design of the spoken component.

5. The searching procedure.

Quantitative methods have been long and are now widely applied in linguistic researches. The statistical data obtained enables to draw solid conclusions. Nowadays the via-computer access to large amounts of linguistic evidence helps to avoid time and effort consuming complicated formula-based calculations. The corpus method is in question.

Corpus is a large collection of computer-readable writings.

Corpus Linguistics is a study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use. At present, effectiveness and usefulness of corpus linguistics is closely related to the development of computer science. There are:

The Bank of English – 524 mln words (COBUILD dictionaries are based on it).

The Corpus of Contemporary American English (COCA) – 450 million words (1990-2012)

The Longman Written American Corpus is a dynamic corpus of 100 million words comprising running text from newspapers, journals, magazines, best-selling novels, technical and scientific writing, and coffee-table books.

The Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech.

The British National Corpus – 100 mln words.

The Czech Corpus – focuses mainly on written Czech, over 100 million words.

The International Netherlands Language Corpus – 38 mln.words.

The International Netherlands Language Newspaper Corpus – 27 mln.words.

The Portuguese Corpus – 45 million words.

The Oslo Corpus of Bosnian Texts – 1.5 million words.

<<< < Предыдущая 1 23 / 93 4 5 6 7 8 9 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
01.05.2025481.28 Кб0lection_1.doc
#
01.07.2025121.86 Кб0LECTION_1.doc
#
01.07.2025104.96 Кб0LECTION_4.doc
#
23.02.2016356.35 Кб30Lection_servlets.doc
#
01.07.202558.37 Кб0lecture#2.doc
#
01.07.2025279.55 Кб0Lectures_1-4_part-time_2017.doc
#
14.08.201955.81 Кб2lecture_2.doc
#
01.05.202584.99 Кб0lecture_3.doc
#
01.05.202548.13 Кб0lecture_4.doc
#
10.09.201953.76 Кб7lecture_5.doc
#
01.05.202586.02 Кб0lecture_6.doc