Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Lectures_1-4_part-time_2017.doc
Скачиваний:
0
Добавлен:
01.07.2025
Размер:
279.55 Кб
Скачать

2. Design of the corpus

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English.

The BNC World Edition contains 4054 texts and occupies 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units is slightly less: 97,619,934. The total number of s-units is just over 6 million (6,053,093).

• S-units (segment-units): number of <s> elements – more or less equivalent to sentences

• W-units: number of <w> elements – more or less equivalent to words.

The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.

Table 1. Composition of the BNC World Edition

Text type

Texts

Kbytes

W-units

S-units

percent

Spoken demographic

153

4206058

4.30

610563

10.08

Spoken context-governed

757

6135671

6.28

428558

7.07

All Spoken

910

10341729

10.58

1039121

17.78

Written books and periodicals

2688

78580018

80.49

4403803

72.75

Written-to-be-spoken

35

1324480

1.35

120153

1.98

Written miscellaneous

421

7373707

7.55

490016

8.09

All Written

3144

87278205

89.39

5013972

82.82

All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer “shelf-life”, though most (75 per cent) of the latter were published no earlier than 1975.

Table 2. Date of production

Creation date

texts

w-units

%

s-units

%

Unknown

162

1814051

1.85

127132

2.10

Before 1974

47

1741624

1.78

121323

2.00

1974 to 1983

156

4621950

4.73

255057

4.21

1984 to 1994

3689

89442309

91.62

5549581

91.68

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]