A Glossary in Corpus Linguistics / 0748624031
.pdfA GLOSSARY OF
CORPUS LINGUISTICS
TITLES IN THE SERIES INCLUDE
Peter Trudgill
A Glossary of Sociolinguistics
0 7486 1623 3
Jean Aitchison
A Glossary of Language and Mind
0 7486 1824 4
Laurie Bauer
A Glossary of Morphology
0 7486 1853 8
Alan Davies
A Glossary of Applied Linguistics
0 7486 1854 6
Geoffrey Leech
A Glossary of English Grammar
0 7486 1729 9
Alan Cruse
A Glossary of Semantics and Pragmatics
0 7486 2111 3
Philip Carr
A Glossary of Phonology
0 7486 2234 9
Vyvyan Evans
A Glossary of Cognitive Linguistics
0 7486 2280 2
Mauricio J. Mixco and Lyle Campbell
A Glossary of Historical Linguistics
0 7486 2379 5
A Glossary of
Corpus Linguistics
Paul Baker, Andrew Hardie
and Tony McEnery
Edinburgh University Press
© Paul Baker, Andrew Hardie and Tony McEnery, 2006
Edinburgh University Press Ltd
22 George Square, Edinburgh
Typeset in Sabon
by Norman Tilley Graphics, Northampton, and printed and bound in Finland
by WS Bookwell
A CIP record for this book is available from the British Library
ISBN-10 0 7486 2403 1 (hardback) ISBN-13 978 0 7486 2403 4 ISBN-10 0 7486 2018 4 (paperback) ISBN-13 978 0 7486 2018 0
The right of Paul Baker, Andrew Hardie and Tony McEnery to be identified as authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
Published with the support of the Edinburgh University Scholarly Publishing Initiatives Fund
Introductory Notes
Website Addresses
We have tried to avoid referring to website addresses where possible, as we found that some of the websites we included at the start of writing this book were no longer in existence when we reached the final stages. We have included websites of some organisations, groups, corpora or software where we feel that the site is unlikely to close down or move. However, we cannot vouch for the longevity of all of the websites given here. If readers wish to follow up specific terms on the internet and are taken to a dead link, we suggest that they accept our apologies and then try a reputable search engine like www.google.com (assuming that Google still exists!).
List of Acronyms
Corpus linguistics is a discipline that has yielded a prolific number of acronyms. This presents a problem in terms of consistency: some terms are best known by their acronym, others are best known by their full-name. We want to make the ordering of dictionary entries consistent, yet we also want them to be easy to find. So, in ordering dictionary entries we have made the decision to spell out all acronyms as full words, while including a list of all of the acronyms at the beginning of the dictionary along with their full titles. Therefore, readers who want to use this dictionary to find out about the BNC can look up its full title in the acronym list at
2 A GLOSSARY OF CORPUS LINGUISTICS
the beginning of the book, and then go to the dictionary entry under British National Corpus.
ACASD (Automatic Content Analysis of Spoken Discourse) word sense tagging system
ACE (Australian Corpus of English)
ACH (Association for Computers and the Humanities) ACL (Association for Computational Linguistics)
ACLDCI (Association for Computational Linguistics Data Collection Initiative)
AGTK (Annotation Graph Toolkit)
AHI (American Heritage Intermediate) Corpus
ALLC (Association for Literary and Linguistic Computing) AMALGAM (Automatic Mapping Among Lexico-
Grammatical Annotation Models) Tagger ANC (American National Corpus)
ANLT (Alvey Natural Language Tools) AP (Associated Press) Treebank
APHB (American Printing House for the Blind) Treebank ARCHER (Representative Corpus of Historical English
Registers) Corpus
ASCII (American Standard Code for Information Exchange) ATC (Air Traffic Control) Corpus
AUTASYS (Automatic Text Annotation System) Tagger BAS (Bavarian Archive for Speech Signals)
BASE (British Academic Spoken English) Corpus BNC (British National Corpus)
BoE (Bank of English)
CALL (Computer Assisted Language Learning) CAMET (Computer Archive of Modern English Texts)
CANCODE (Cambridge and Nottingham Corpus of Discourse in English)
CEEC (Corpus of Early English Correspondence) CEG (Cronfa Electroneg o Gymraeg)
CELEX (Centre for Lexical Information)
A GLOSSARY OF CORPUS LINGUISTICS |
3 |
CES (Corpus Encoding Standard)
CETH (Centre for Electronic Texts in the Humanities) CHAT (Codes for the Human Analysis of Transcripts)
System
CHILDES (Child Language Data Exchange System) CIDE (Collaborative International Dictionary of English) CLAN (Computerized Language Analysis) System
CLAWS (Constituent Likelihood Automatic Word-tagging System)
CLEC (Chinese Learner English Corpus) CLR (Consortium for Lexical Research)
CMU SLM (Carnegie Mellon University–Cambridge Statistical Language Modeling) Toolkit
Coconut (Cooperative, Coordinated Natural Language Utterances) Corpus
COLT (Bergen Corpus of London Teenage English) CRATER (Corpus Resources and Terminology Extraction) CSAE (Corpus of Spoken American English)
CSLU (Centre for Spoken Language Understanding) Speech Corpora
CSTR (Centre for Speech Technology Research) CWBC (Corpus of Written British Creole) DAT (Dialogue Annotation Tool)
DCPSE (Diachronic Corpus of Present-day Spoken English) DTD (document type definition)
EACL (European Chapter of the Association for Computational Linguistics)
EAGLES (Expert Advisory Group on Language Engineering Standards)
ECI (European Corpus Initiative) ELAN (Eudico Linguistic Annotator)
ELAN (European Language Activity Network)
ELDA (Evaluations and Language Resources Distribution Agency)
ELRA (European Language Resources Association)
4 A GLOSSARY OF CORPUS LINGUISTICS
ELSNET (European Network of Excellence in Human Language Technologies)
EMILLE (Enabling Minority Language Engineering) Corpus ENGCG (Constraint Grammar Parser of English)
ESFSLD (European Science Foundation Second Language Databank)
FLOB (Freiburg–LOB Corpus of British English) FRIDA (French Interlanguage Database)
FROWN (Freiburg–Brown Corpus of American English) FTF (Fuzzy Tree Fragments)
GATE (General Architecture for Text Engineering) GPEC (Guangzhou Petroleum English Corpus) HCRC (Human Communication Research Centre)
HKUST (Hong Kong University Of Science And Technology) Corpus
HLT (human language technology) HTML (Hypertext Markup Language)
ICAME (International Computer Archive of Modern and Medieval English)
ICE (International Corpus of English)
ICECUP (International Corpus of English Corpus Utility Program)
ICLE (International Corpus of Learner English) IMS (Institut für Maschinelle Sprachverarbeitung)
ISLE (Interactive Spoken Language Education) Corpus IviE (Intonational Variation in English) Corpus KWIC (key word in context)
LCMC (Lancaster Corpus of Mandarin Chinese) LCPW (Lancaster Corpus of Children’s Project Writing) LDB (Linguistic DataBase)
LDC (Linguistic Data Consortium)
LeaP (Learning the Prosody of a Foreign Language) Corpus Lindsei (Louvain International Database of Spoken English
Interlanguage)
LLC (London–Lund Corpus)
A GLOSSARY OF CORPUS LINGUISTICS |
5 |
LOB (Lancaster–Oslo/Bergen) Corpus
MARSEC (Machine-Readable Spoken English Corpus) MBT (Memory Based Tagger)
METER (Measuring Text Reuse) Corpus
MICASE (Michigan Corpus of Academic Spoken English) MTP (Münster Tagging Project)
MXPOST (Maximum Entropy Part-of-Speech Tagger) NECTE (Newcastle Electronic Corpus of Tyneside English) NEET (Network of Early Eighteenth Century English Texts) NITCS (Northern Ireland Transcribed Corpus of Speech) NLP (natural language processing)
OCP (Oxford Concordance Programme) OCR (optical character recognition)
OLAC (Open Language Archives Community) OTA (Oxford Text Archive)
POS (part-of-speech) tagging
POW (Polytechnic of Wales) corpus
SARA (SGML-Aware Retrieval Application) ScoSE (Saarbrücken Corpus of Spoken English) SEC (Lancaster/IBM Spoken English Corpus) SEU (Survey of English Usage) Corpus
SGML (Standard Generalised Markup Language)
SPAAC (Speech Act Annotated Corpus for Dialogue Systems)
SUSANNE (Surface and Underlying Structural Analyses of Naturalistic English) Corpus
TEI (Text Encoding Initiative)
TELC (Thai English Learner Corpus)
TESS (Text Segmentation for Speech) Project TLFi (Trésor de la Langue Française Informatisé) TLG (Thesaurus Linguae Graecae)
TnT (Trigrams‘n’Tags)
TOSCA (Tools for Syntactic Corpus Analysis) Corpus T2K-SWAL (TOEFL 2000 Spoken and Written Academic
Language Corpus)