- •1. Informants – an empirical, active method
- •2. Recording – an empirical, active, instrumental method
- •4. Experiments
- •5. The Comparative Method. The reconstruction technique.
- •7. Computer Techniques
- •2. The British National Corpus: World Edition October 2000
- •1. General Notes
- •2. Design of the corpus
- •3. Selection features
- •3.1. Sample size and method
- •3.2. Domain
- •3. Selection procedures employed – Books
- •4. Design of the spoken component
- •5. How to search
- •Simple Search from the British Library
- •1. Structural grammatical theories.
- •The distributional analysis;
- •The Immediate Constituent (ic) analysis (phrase-structure grammar).
2. Design of the corpus
There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English.
The BNC World Edition contains 4054 texts and occupies 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units is slightly less: 97,619,934. The total number of s-units is just over 6 million (6,053,093).
• S-units (segment-units): number of <s> elements – more or less equivalent to sentences
• W-units: number of <w> elements – more or less equivalent to words.
The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.
Table 1. Composition of the BNC World Edition
Text type |
Texts |
Kbytes |
W-units |
S-units |
percent |
Spoken demographic |
153 |
4206058 |
4.30 |
610563 |
10.08 |
Spoken context-governed |
757 |
6135671 |
6.28 |
428558 |
7.07 |
All Spoken |
910 |
10341729 |
10.58 |
1039121 |
17.78 |
Written books and periodicals |
2688 |
78580018 |
80.49 |
4403803 |
72.75 |
Written-to-be-spoken |
35 |
1324480 |
1.35 |
120153 |
1.98 |
Written miscellaneous |
421 |
7373707 |
7.55 |
490016 |
8.09 |
All Written |
3144 |
87278205 |
89.39 |
5013972 |
82.82 |
All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer “shelf-life”, though most (75 per cent) of the latter were published no earlier than 1975.
Table 2. Date of production
Creation date |
texts |
w-units |
% |
s-units |
% |
Unknown |
162 |
1814051 |
1.85 |
127132 |
2.10 |
Before 1974 |
47 |
1741624 |
1.78 |
121323 |
2.00 |
1974 to 1983 |
156 |
4621950 |
4.73 |
255057 |
4.21 |
1984 to 1994 |
3689 |
89442309 |
91.62 |
5549581 |
91.68 |
