Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
1.05 Mб




Peter Trudgill

A Glossary of Sociolinguistics

0 7486 1623 3

Jean Aitchison

A Glossary of Language and Mind

0 7486 1824 4

Laurie Bauer

A Glossary of Morphology

0 7486 1853 8

Alan Davies

A Glossary of Applied Linguistics

0 7486 1854 6

Geoffrey Leech

A Glossary of English Grammar

0 7486 1729 9

Alan Cruse

A Glossary of Semantics and Pragmatics

0 7486 2111 3

Philip Carr

A Glossary of Phonology

0 7486 2234 9

Vyvyan Evans

A Glossary of Cognitive Linguistics

0 7486 2280 2

Mauricio J. Mixco and Lyle Campbell

A Glossary of Historical Linguistics

0 7486 2379 5

A Glossary of

Corpus Linguistics

Paul Baker, Andrew Hardie

and Tony McEnery

Edinburgh University Press

© Paul Baker, Andrew Hardie and Tony McEnery, 2006

Edinburgh University Press Ltd

22 George Square, Edinburgh

Typeset in Sabon

by Norman Tilley Graphics, Northampton, and printed and bound in Finland

by WS Bookwell

A CIP record for this book is available from the British Library

ISBN-10 0 7486 2403 1 (hardback) ISBN-13 978 0 7486 2403 4 ISBN-10 0 7486 2018 4 (paperback) ISBN-13 978 0 7486 2018 0

The right of Paul Baker, Andrew Hardie and Tony McEnery to be identified as authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

Published with the support of the Edinburgh University Scholarly Publishing Initiatives Fund

Introductory Notes

Website Addresses

We have tried to avoid referring to website addresses where possible, as we found that some of the websites we included at the start of writing this book were no longer in existence when we reached the final stages. We have included websites of some organisations, groups, corpora or software where we feel that the site is unlikely to close down or move. However, we cannot vouch for the longevity of all of the websites given here. If readers wish to follow up specific terms on the internet and are taken to a dead link, we suggest that they accept our apologies and then try a reputable search engine like www.google.com (assuming that Google still exists!).

List of Acronyms

Corpus linguistics is a discipline that has yielded a prolific number of acronyms. This presents a problem in terms of consistency: some terms are best known by their acronym, others are best known by their full-name. We want to make the ordering of dictionary entries consistent, yet we also want them to be easy to find. So, in ordering dictionary entries we have made the decision to spell out all acronyms as full words, while including a list of all of the acronyms at the beginning of the dictionary along with their full titles. Therefore, readers who want to use this dictionary to find out about the BNC can look up its full title in the acronym list at


the beginning of the book, and then go to the dictionary entry under British National Corpus.

ACASD (Automatic Content Analysis of Spoken Discourse) word sense tagging system

ACE (Australian Corpus of English)

ACH (Association for Computers and the Humanities) ACL (Association for Computational Linguistics)

ACLDCI (Association for Computational Linguistics Data Collection Initiative)

AGTK (Annotation Graph Toolkit)

AHI (American Heritage Intermediate) Corpus

ALLC (Association for Literary and Linguistic Computing) AMALGAM (Automatic Mapping Among Lexico-

Grammatical Annotation Models) Tagger ANC (American National Corpus)

ANLT (Alvey Natural Language Tools) AP (Associated Press) Treebank

APHB (American Printing House for the Blind) Treebank ARCHER (Representative Corpus of Historical English

Registers) Corpus

ASCII (American Standard Code for Information Exchange) ATC (Air Traffic Control) Corpus

AUTASYS (Automatic Text Annotation System) Tagger BAS (Bavarian Archive for Speech Signals)

BASE (British Academic Spoken English) Corpus BNC (British National Corpus)

BoE (Bank of English)

CALL (Computer Assisted Language Learning) CAMET (Computer Archive of Modern English Texts)

CANCODE (Cambridge and Nottingham Corpus of Discourse in English)

CEEC (Corpus of Early English Correspondence) CEG (Cronfa Electroneg o Gymraeg)

CELEX (Centre for Lexical Information)



CES (Corpus Encoding Standard)

CETH (Centre for Electronic Texts in the Humanities) CHAT (Codes for the Human Analysis of Transcripts)


CHILDES (Child Language Data Exchange System) CIDE (Collaborative International Dictionary of English) CLAN (Computerized Language Analysis) System

CLAWS (Constituent Likelihood Automatic Word-tagging System)

CLEC (Chinese Learner English Corpus) CLR (Consortium for Lexical Research)

CMU SLM (Carnegie Mellon University–Cambridge Statistical Language Modeling) Toolkit

Coconut (Cooperative, Coordinated Natural Language Utterances) Corpus

COLT (Bergen Corpus of London Teenage English) CRATER (Corpus Resources and Terminology Extraction) CSAE (Corpus of Spoken American English)

CSLU (Centre for Spoken Language Understanding) Speech Corpora

CSTR (Centre for Speech Technology Research) CWBC (Corpus of Written British Creole) DAT (Dialogue Annotation Tool)

DCPSE (Diachronic Corpus of Present-day Spoken English) DTD (document type definition)

EACL (European Chapter of the Association for Computational Linguistics)

EAGLES (Expert Advisory Group on Language Engineering Standards)

ECI (European Corpus Initiative) ELAN (Eudico Linguistic Annotator)

ELAN (European Language Activity Network)

ELDA (Evaluations and Language Resources Distribution Agency)

ELRA (European Language Resources Association)


ELSNET (European Network of Excellence in Human Language Technologies)

EMILLE (Enabling Minority Language Engineering) Corpus ENGCG (Constraint Grammar Parser of English)

ESFSLD (European Science Foundation Second Language Databank)

FLOB (Freiburg–LOB Corpus of British English) FRIDA (French Interlanguage Database)

FROWN (Freiburg–Brown Corpus of American English) FTF (Fuzzy Tree Fragments)

GATE (General Architecture for Text Engineering) GPEC (Guangzhou Petroleum English Corpus) HCRC (Human Communication Research Centre)

HKUST (Hong Kong University Of Science And Technology) Corpus

HLT (human language technology) HTML (Hypertext Markup Language)

ICAME (International Computer Archive of Modern and Medieval English)

ICE (International Corpus of English)

ICECUP (International Corpus of English Corpus Utility Program)

ICLE (International Corpus of Learner English) IMS (Institut für Maschinelle Sprachverarbeitung)

ISLE (Interactive Spoken Language Education) Corpus IviE (Intonational Variation in English) Corpus KWIC (key word in context)

LCMC (Lancaster Corpus of Mandarin Chinese) LCPW (Lancaster Corpus of Children’s Project Writing) LDB (Linguistic DataBase)

LDC (Linguistic Data Consortium)

LeaP (Learning the Prosody of a Foreign Language) Corpus Lindsei (Louvain International Database of Spoken English


LLC (London–Lund Corpus)



LOB (Lancaster–Oslo/Bergen) Corpus

MARSEC (Machine-Readable Spoken English Corpus) MBT (Memory Based Tagger)

METER (Measuring Text Reuse) Corpus

MICASE (Michigan Corpus of Academic Spoken English) MTP (Münster Tagging Project)

MXPOST (Maximum Entropy Part-of-Speech Tagger) NECTE (Newcastle Electronic Corpus of Tyneside English) NEET (Network of Early Eighteenth Century English Texts) NITCS (Northern Ireland Transcribed Corpus of Speech) NLP (natural language processing)

OCP (Oxford Concordance Programme) OCR (optical character recognition)

OLAC (Open Language Archives Community) OTA (Oxford Text Archive)

POS (part-of-speech) tagging

POW (Polytechnic of Wales) corpus

SARA (SGML-Aware Retrieval Application) ScoSE (Saarbrücken Corpus of Spoken English) SEC (Lancaster/IBM Spoken English Corpus) SEU (Survey of English Usage) Corpus

SGML (Standard Generalised Markup Language)

SPAAC (Speech Act Annotated Corpus for Dialogue Systems)

SUSANNE (Surface and Underlying Structural Analyses of Naturalistic English) Corpus

TEI (Text Encoding Initiative)

TELC (Thai English Learner Corpus)

TESS (Text Segmentation for Speech) Project TLFi (Trésor de la Langue Française Informatisé) TLG (Thesaurus Linguae Graecae)

TnT (Trigrams‘n’Tags)

TOSCA (Tools for Syntactic Corpus Analysis) Corpus T2K-SWAL (TOEFL 2000 Spoken and Written Academic

Language Corpus)