- •1. Informants – an empirical, active method
- •2. Recording – an empirical, active, instrumental method
- •4. Experiments
- •5. The Comparative Method. The reconstruction technique.
- •7. Computer Techniques
- •2. The British National Corpus: World Edition October 2000
- •1. General Notes
- •2. Design of the corpus
- •3. Selection features
- •3.1. Sample size and method
- •3.2. Domain
- •3. Selection procedures employed – Books
- •4. Design of the spoken component
- •5. How to search
- •Simple Search from the British Library
- •1. Structural grammatical theories.
- •The distributional analysis;
- •The Immediate Constituent (ic) analysis (phrase-structure grammar).
3. Selection procedures employed – Books
Roughly half the titles were randomly selected from available candidates identified in Whitaker’s Books in Print (BIP), 1992, by students of Library and Information Studies at Leeds City University. Each text randomly chosen was accepted only if it fulfilled certain criteria: it had to be published by a British publisher, contain sufficient pages of text to make its incorporation worthwhile, consist mainly of written text, fall within the designated time limits, and cost less than a set price. The final selection weeded out texts by non-UK authors. Half of the books having been selected by this method, the remaining half were selected systematically.
4. Design of the spoken component
The British National Corpus project undertook to produce five to ten million words of orthographically transcribed speech, covering a wide range of speech variation. A large proportion of the spoken part of the corpus — over four million words — comprises spontaneous conversational English. The importance of conversational dialogue to linguistic study is unquestionable: it is the dominant component of general language both in terms of language reception and language production. The demographic sampling approach was adopted for approximately half of the spoken part of the corpus. The demographic component of the corpus was complemented with a separate text typology intended to cover the full range of linguistic variation found in spoken language; this is termed the context-governed part of the corpus.The sampling frame was defined in terms of the language production of the population of British English speakers in the United Kingdom. Representativeness was achieved by sampling a spread of language producers in terms of age, gender, social group, and region, and recording their language output over a set period of time.
A total of 757 texts (6,153,671 words) make up the context-governed part of the corpus. The following contexts are distinguished:
Table 5. Context in which spoken text was captured
Context |
texts |
w-units |
% |
s-units |
% |
Educational/Informative |
169 |
1633303 |
26.61 |
119252 |
27.82 |
Business |
131 |
1285938 |
20.95 |
108101 |
25.22 |
Public/Institutional |
262 |
1655263 |
26.97 |
96504 |
22.51 |
Leisure |
195 |
1561167 |
25.44 |
104701 |
24.43 |
Sampling procedure
124 adults (aged 15+) were recruited from across the United Kingdom. Recruits were of both sexes and from all age groups and social classes. The intention was, as far as possible, to recruit equal numbers of men and women, equal numbers from each of the six age groups, and equal numbers from each of four social classes.
Recording procedure
All conversations were recorded as unobtrusively as possible, so that the material gathered approximated closely to natural, spontaneous speech. In many cases the only person aware that the conversation was being taped was the person carrying the recorder.
The context-governed part of the corpus
As mentioned above, the spoken texts in the demographic part of the corpus consists mainly of conversational English. A complementary approach was developed to create what is termed the context-governed part of the corpus. As in other spoken corpora, the range of text types was selected according to a priori linguistically motivated categories. At the top layer of the typology is a division into four equal-sized contextually based categories: educational, business, public/institutional, and leisure. Each is divided into the subcategories monologue (40 percent) and dialogue (60 percent). Each monologue subcategory therefore totals 10 percent of the context-governed part of the corpus, and each dialogue subcategory 15 percent.
Sampling procedure
For the most part, a variety of text types were sampled within three geographic regions. However, some text types, such as parliamentary proceedings, and most broadcast categories, apply to the country as a whole and were not regionally sampled. Different sampling strategies were required for each text type, and these are:
Educational and informative domain (area):
Lectures, talks, educational demonstrations Within each sampling area a university (or college of further education) and a school were selected. A range of lectures and talks was recorded, varying the topic, level, and speaker gender.
News commentaries Regional sampling was not applied, but both national and regional broadcasting companies were sampled. The topic, level, and gender of commentator was varied.
Classroom interaction Schools were regionally sampled and the level (generally based on student age) and topic were varied. Home tutorials were also included.
Business:
Company talks and interviews Sampling took into account company size, areas of activity, and gender of speakers.
Trade union talks Talks to union members, branch meetings and annual conferences were all sampled.
Sales demonstrations A range of topics was included.
Business meetings Companies were selected according to size, area of activity, and purpose of meeting.
Consultations These included medical, legal, business and professional consultations. All categories under this heading were regionally sampled.
Public/ or institutional:
Political speeches Regional sampling of local politics, plus speeches in both the House of Commons and the House of Lords.
Sermons Different denominations were sampled.
Public/government talks Regional sampling of local inquiries and meetings, plus national issues at different levels.
Council meetings Regionally sampled, covering parish, town, district, and county councils.
Religious meetings domain Includes church meetings, group discussions, and so on.
Parliamentary proceedings Sampling of main sessions and committees, House of Commons and House of Lords.
Legal proceedings Royal Courts of Justice, and local Magistrates and similar courts were sampled.
Leisure:
Speeches Regionally sampled, covering a variety of occasions and speakers.
Sports commentaries Exclusively broadcast, sampling a variety of sports, commentators, and TV/radio channels.
Talks to clubs Regionally sampled, covering a range of topics and speakers.
Broadcast chat shows and phone-ins Only those that include a significant amount of unscripted speech were selected from both television and radio.
Club meetings Regionally sampled, covering a wide range of clubs.
