Guy Aston and Lou Burnard,   The BNC Handbook, Edinburgh: Edinburgh U.P., 1998, pp. 4-27

1.1 What is a corpus?

We shall discuss what a corpus is by looking at how the word is used, in particular by linguists. What kind of an object is a corpus, and what is it likely to be useful for?
        We learn the sense of a newly-encountered word in different ways. Young children experimentally combine or mutate words to see which uses meet with approval; older ones do the same in the process of defining peer groups based on a shared exotic vocabulary. In both cases, meaning is exemplified or confirmed by repeated, socially sanctioned, usage. One of the objectives of traditional linguistics was to overcome this requirement of exposure to "language in use" - an impractical option for those wishing to learn a new language in a short time, or to understand a language no longer spoken anywhere by defining powerful general principles which would enable one to derive the sense of any newly-encountered word simply by applying etymological or morphological rules. Knowles (1996), arguing that linguistic theory is above all a matter of organizing linguistic knowledge in this way, points for example to the success with which such models have been used in training generations of schoolchildren to understand Latin or Greek unseens.
        While only experience can tell us what a word "is understood to mean", such analytic methods tell us what a word "ought to mean". A modern dictionary combines the strengths of both methods, by organizing evidence of usage into an analytic framework of senses.
        What, then, does the word 'corpus' actually mean? We might do worse than consider the five distinct senses listed in the second edition of the Oford English Dictionary as a starting point (see figure on preceding page). Of these, two particularly refer to language. The first is that of "A body or collection of writings or the like the whole body of literature on any subject". Thus we may speak of the 'Shakespearean corpus', meaning the entire collection of texts by Shakespeare. The second is that of "the body of written or spoken material upon which a linguistic analysis is based". This is the sense of the word from which the phrase 'corpus linguistics' derives, and in which we use it throughout this book. The two senses can, of course, overlap - as when, for example, the entire collection of a particular author's work is subjected to linguistic analysis. But a key distinction remains. In the words of John Sinclair, the linguist's corpus is "a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language" (Sinclair 1996). It is an object designed for the purpose of linguistic analysis, rather than an object defined by accidents of authorship or history.

As such, corpora can be contrasted with archives or collections whose components are unlikely to have been assembled with such goals in mind (see further Atkins et al 1992). Given this emphasis on intended function, the composition of a corpus will depend on the scope of the investigation. It may be chosen to characterize a particular historical state or a particular variety of a particular language, or it may be selected to enable comparison of a number of historical states, varieties or languages. Varieties may be selected on geographical (for example, British, American, or Indian English), sociological (for example, by gender, social class, or age group), or generic bases (for example written vs. spoken; legal or medical; technical or popular; private or public correspondence). Generally the texts; to be included in a corpus are defined according to criteria which are external to the texts themselves, relating to the situation of their production or reception rather than any intrinsic property they may have. Discovery of such intrinsic properties (if any) may, indeed, be the purpose of the exercise.
        Corpora stored and processed by computer, once the exception, are now the norm. It is worth noting however that there is very little in the practice of corpus linguistics which could not equally well be done in principle by nonautomatic means. However, in general, corpora are understood to be computer processable corpora.
        The British National Corpus (BNC) consists of a sample collection which aims to represent the universe of contemporary British English. Insofar as it attempts to capture the full range of varieties of language use, it is a balanced corpus rather than a register-specific or dialect-specific one; it is also a mixed corpus, containing both written texts and spoken ones - transcriptions of naturally occurring speech.

1.2 What can you get out of a corpus?

A corpus can enable grammarians, lexicographers, and other interested parties to provide better descriptions of a language by embodying a view of it which is beyond any one individual's experience. The authoritative Comprehensive Grammar of the English Language (Quirk et al 1985) was derived in part from evidence provided by one of the first modern English corpora, the Survey of English Usage. Svartvik and Quirk (1980: 9) observe that: "Since native speakers include lawyers, journalists, gynaecologists, school teachers, engineers, and a host of other specialists, it follows: (a) that no individual can be expected to have an adequate command of the whole 'repertoire': who, for example, could equally well draft a legal statute and broadcast a commentary on a football game? (b) that no grammarian can describe adequately the grammatical and stylistic properties of the whole repertoire from his own unsupplemented resources: 'introspection' as the sole guiding star is clearly ruled out."

        A corpus which is designed to sample the entire 'repertoire' offers a tool for the description of properties with which even the grammarian may not be personally familiar. Corpus-based descriptions have produced a few surprises, sometimes contradicting the received wisdom. Sampson (1996) describes how he became a corpus linguist as a result of his experience with theories of recursive 'central embedding' in sentences such as 'the mouse the cat the dog chased caught squeaked', where component clauses nest within each other like Russian dolls. Most discussions of this phenomenon had used linguistic intuition to analyze entirely imaginary sentences, claiming that such constructions were in some sense 'unnatural', though syntactically feasible. However, when Sampson turned to look at corpus data, he found that such centrally embedded structures were actually far from rare, and used in ways which appeared entirely 'natural'. While it does not eliminate linguistic intuition in classifying and evaluating instances, the use of corpora can remove much of the need to invent imaginary data, and can provide relatively objective evidence of frequency.

        The utility of a corpus to the lexicographer is even more striking: careful study of a very large quantity and wide range of texts is required to capture and exemplify anything like all the half-million or more words used in contemporary British English. It is no coincidence that dictionary publishers have played major roles in setting up the two largest current corpora of British English: the Bank of English (HarperCollins) and the BNC (Oxford University Press, Longman, Chambers); or that, in the increasingly competitive market for English language learners' dictionaries, four new editions published in 1995 (the Collins Cobuild Dictionary, the Cambridge Dictionary of International English, the Longman Dictionary of Contemporary English, the Oxford Advanced Learner's Dictionary) should all have made the fact of their being 'corpusbased' a selling point.

        Linguists have always made use of collections of textual data to produce grammars and dictionaries, but these have traditionally been analyzed in a relatively ad hoc manner, on the basis of individual salience, with a consequent tendency to privilege rare and striking phenomena at the expense of mundane or very high frequency items. Corpora, in particular computer-processable corpora, have instead allowed linguists to adopt a principle of 'total accountability', retrieving all the occurrences of a particular word or structure in the corpus for inspection, or (where this would be infeasible) randomly selected samples. This generally involves the use of specialized software to search for occurrences (or co-occurrences) of specified strings or patterns within the corpus. Other software may be used to calculate frequencies, or statistics derived from them, for example to produce word lists ordered by frequency of occurrence or to identify cooccurrences which are significandy more (or less) frequent than chance.

        Concordances are listings of the occurrences of a particular feature or combination of features in a corpus. Each occurrence found, or hit, is displayed with a certain amount of context - the text immediately preceding and following it. The most commonly used concordance type (known as KWIC for 'Key Word In Context') shows one hit per line of the screen or printout, with the principal search feature, or focus, highlighted in the centre. Concordances also generally give a reference for each hit, showing which source text in the corpus it is taken from and the line or sentence number. It is then up to the user to inspect and interpret the output. The amount of text visible in a KWIC display is generally enough to make some sense of the hit, though for some purposes, such as the interpretation of pronominal reference, a larger context may have to be specified. Most concordancing software allows hits to be formatted, sorted, edited, saved and printed in a varlety of manners.

        Frequency counts are implicit in concordancing, since finding all the occurrences of a particular feature in the corpus makes counting the hits a trivial task. Software will generally allow numbers to be calculated without actually displaying the relevant concordance - an important feature where thousands or even millions of occurrences are involved. Frequency counts can be elaborated statistically, in many cases automatically by the concordancing software, but should be interpreted with care (see further 2.2.4 on page 40).
        Concordances and frequency counts can provide a wide variety of linguistic information. We list some of the kinds of questions which may be asked, relating to lexis, morphosyntax, and semantics or pragmatics.
        A corpus can be analyzed to provide the following kinds of lexical information:

How often does a particular word-form, or group of forms (such as the various forms of the verb 'start': 'start', 'starts', 'starting', 'started') appear in the corpus? Is 'start' more or less common than 'begin'? The relative frequency of any word-form can be expressed as a z-score, that is, as the number of standard deviations from the mean frequency of wordforms in the corpus. The number of occurrences of a word-form in the entire BNC ranges from over 6 million for the most frequent word, 'the', to 1 for 'aaarrrrrrrrggggggghhhhhh' or 'about-to-be-murdered'. The mean frequency is approximately 150, but the standard deviation of the mean is very high (over 11,000), indicating that there are very many words with frequencies far removed from the mean.

With what meanings is a particular word-form, or group of forms, used? Is 'back' more frequendy used with reference to a part of the body or a direction? Do we 'start' and 'begin' the same sorts of things?

How often does a particular word-form, or group of forms, appear near to other particular word forms, which collocate with it within a given distance? Does 'immemorial' always have 'time' as a collocate? Is it more common for 'prices' to 'rise' or to 'increase'? Do different senses of the same word have different collocates?

How often does a particular word-form, or group of forms, appear in particular grammatical structures, which colligate with it? Is it more common to 'start to do something', or to 'start doing it'? Do different senses of the same word have different colligates?

How often does a particular word-form, or group of forms, appear in a certain semantic environment, showing a tendency to have positive or negative connotations? Does the intensifier 'totally' always modify verbs and adjectives with a negative meaning, such as 'fail' and 'ridiculous'?

How often does a particular word-form, or group of forms, appear in a particular type of text, or in a particular type of speaker or author's language? Is 'little' or 'small' more common in conversation? Do women say 'sort of' more than men? Does the word 'wicked' always have positive connotations for the young? Is the word 'predecease' found outside legal texts and obituaries? Do lower-class speakers use more (or different) expletives?

Whereabouts in texts does a particular word form, or group of forms, tend to occur? Does its meaning vary according to its position? How often does it occur within notes or headings, following a pause, near the end of a text, or at the beginning of a sentence, paragraph or utterance? And is it in fact true that 'and' never begins a sentence?

A corpus can also be analyzed to provide the following kinds of morphosyntactic information:

How frequent is a particular morphological form or grammatical structure? How much more common are clauses with active than with passive main verbs? What proportion of passive forms have the agent specified in a following 'by' phrase?

With what meanings is a particular structure used? Is there a difference between 'I hope that' and 'I hope to'?

How often does a particular structure occur with particular collocates or colligates? Is 'if I was you' or 'if I were you' more common?

How often does a particular structure appear in a particular type of text, or in a particular type of speaker or author's language?

Are passives more common in scientific texts? Is the subjunctive used less by younger speakers?

Whereabouts in texts does a particular structure tend to occur? Do writers and speakers tend to switch from the past tense to the 'historic present' at particular points in narratives?

And, finally, a corpus can be analyzed to provide semantic or pragmatic information. Rather than examining the meanings and uses of particular forms, we can use it to identify the forms associated with particular meanings and uses:

What tools are most frequently referred to in texts talking about gardening?

What fields of metaphor are employed in economic discourse?

Do the upper-middle classes talk differently about universities from the working classes?

How do speakers close conversations, or open lectures? How do chairpersons switch from one point to another in meetings?

Are pauses in conversation more common between utterances than within them?

What happens when conversationalists stop laughing?

        Not all of these types of information are equally easy to obtain. In using concordancing software, specific strings of characters have to be searched for. In order to disambiguate homographs or to identify particular uses of words or structures, it may be necessary to inspect the lines in the output, classifying them individually. Thus while it is relatively easy to calculate the frequency of a word-form and of its collocates, it may be more difficult to calculate its frequency of use as a particular part of speech, with a particular sense, or in a particular position or particular kind of text.
        To help in such tasks, computer corpora are increasingly marked up with a detailed encoding which encompasses both external characteristics of each text and its production, and internal characteristics such as its formal structure. Such information will typically include details of what kind of text it is and where it comes from, details relating to the structure of the text and the status of particular components - division into chapters, paragraphs, spoken utterances, headings, notes, references, editorial comments etc., as well as any linguistic annotation, indicating for instance the part-of-speech value or the root form of each word. Such encoding permits the user to search for strings or patterns in particular kinds, parts or positions of texts, or with particular types of linguistic annotation.
        It can be equally difficult to find instances of particular syntactic, semantic or pragmatic categories unless these happen to have clear lexical correlates, or the corpus markup clearly distinguishes them. For instance, the markup of the BNC might be used to find occurrences of highlighting (typically through italics or underlining in the original), to investigate headings and captions, to generate a list of the publishers responsible for the texts in the corpus, or to identify those texts published by specific publishers.
        While the examples just cited have all concerned analyses within a particular corpus, it is evident that all these areas can also be examined contrastively, comparing data from corpora of different languages, historical periods, dialects or geographical varieties, modes (spoken or written), or registers. By comparing one of the standard corpora collected twenty years ago with an analogous corpus of today, it is possible to investigate recent changes in English. By comparing corpora collected in different parts of the world, it is possible to investigate differences between, for instance, British and Australian English. By comparing a corpus of translated texts with one of texts originally created in the target language, it is possible to identify linguistic properties peculiar to translation. By comparing a small homogeneous corpus of some particular kind of material with a large balanced corpus (such as the BNC), it is possible to identify the distinctive linguistic characteristics of the former.

1.3 How have corpora been used?

This section describes a few major corpora which have previously been created and discusses some of the work done with them, to illustrate current concerns in the field.

1.3.1 "What kinds of corpora exist?
We begin by listing some of the main corpora developed for English in the past, grouped according to the main areas of language use they sample. For a fuller annotated list, see Edwards (1993) or Wichmann et al (in press).

geographical varieties The earliest corpus in electronic form, compiled at Brown University in 1964, contained 1 million words of written American English published in 1961 (Kucera and Francis 1967). The Brown corpus has since been widely imitated, with similarly-designed corpora being compiled for British (the Lancaster-Oslo-Bergen corpus or LOB: Johansson 1980), Indian (the Kolhapur Corpus of Indian English: Shastri 1988), Australian (the Macquarie Corpus of Australian English: Collins and Peters 1988) and New Zealand varieties (the Wellington Corpus: Bauer 1993). The International Corpus of English project (ICE) is currently creating a corpus with similarly-designed components representing each of the major international varieties of contemporary English (Greenbaum 1992).
spoken language corpora The earliest computer corpora, such as Brown and LOB, were collections of written data. A number of corpora consisting of transcripts of spoken English have since been developed. These vary enormously both in the types of speech they include and in the form and detail of transcription employed (see 1.4.2 on page 26). The best-known is probably the London-Lund Corpus, a computerised version of just under half-a-million words of the Survey of English Usage conversational data (Svartvik 1990; Svartvik and Quirk 1980), which has been widely used in comparisons with the LOB corpus of written English. The Corpus of Spoken American English under development at Santa Barbara (Chafe et al 1991) is collecting a similar quantity of American conversational data.
mixed corpora The major large mixed corpus to precede the BNC was the Birmingham collection of English texts, developed at the University of Birmingham with the dictionary publishers Collins during the 1980s as a basis for the production of dictionaries and grammars (see e.g. Sinclair 1987). This originally contained 7.5 million words, growing eventually to nearly 20 million, of which approximately 1.3 million were transcripts of speech. The collection has continued to grow since, having now been incorporated into the 300 million word Bank of English (see 1.4.1 on page 21).
historical varieties The most extensive corpus of historical English is probably the Helsinki corpus of English texts: diachronic and dialectal (Kytö 1993). The corpus has three parts, corresponding with three historical periods (Old, Middle, and Early Modern English); within each period, there are samples of different dialects, permitting not only diachronic comparisons but also synchronic comparisons of different geographical varieties.
child and learner varieties A number of corpora have been compiled relating to particular categories of language users, in particular children who are acquiring English as their first language, and foreign learners of English. They are sometimes termed special corpora (Sinclair 1996), because they document uses of language which are seen as deviant with respect to a general norm. Instances include the Polytechnic of Wales corpus of child language (O'Donoghue 1991), and the International Corpus of Learner English (ICLE) being created at Louvain (Granger 1993).
genre- and topic-specific corpora Other corpora have been designed to include only samples of language of a particular type, for example dealing with a particular topic, or belonging to a particular genre or register. There are many examples, ranging from psycho-linguistically motivated experiments such as the HCRC map task corpus (Anderson et al 1991), consisting of 128 transcribed performances of map-reading tasks, to corpora created for other purposes, such as the Hong Kong corpus of computer science texts, designed to support analysis of technical vocabulary Davison 1992). In the USA, the Linguistic Data Consortium has produced a large number of corpora of specific genres of speech and writing on CD-ROM, ranging from telephone conversations to stock-exchange reports.
multilingual corpora Monolingual corpora of languages other than English are not mentioned here for reasons of space, but a number of multilingual corpora containing texts in both English and one or more other languages have been developed. Some are fairly heterogeneous collections, while others are carefully constructed ensembles of texts selected on the basis of similar criteria in each language. In the former category, the European Corpus Initiative (ECI) has produced a multilingual corpus of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more (Armstrong-Warwick et al 1994). In the latter category, an EU-funded project called PAROLE is currently building directly comparable corpora for each major European language.

1.3.2 Some application areas
The range of corpus-based descriptive work is well documented by Altenbergs bibliographies of corpus linguistics (Altenberg 1990, 1995), and is also covered in a number of introductory textbooks on the field. Recent examples include Sinclair (1991), Stubbs (1996) and McEnery and Wilson (1996); Leech and Fligelstone (1992) and Biber et al (1996) provide accessible short introductions. In this section, we review a handful of studies in order to illustrate some of the areas in which corpus-based work has been carried out, and to raise some of the key methodological issues. No claim to completeness of coverage is intended, as the field is both very varied and rapidly expanding. For up-to-date information, and for a wider (more corpus-like!) perspective, the reader could do a lot worse than to search the World Wide Web for pages on which the phrase 'corpus linguistics' appears.
        Corpus-based research naturally grounds its theorizing in empirical observation rather than in appeals to linguistic intuition or expert knowledge. It thus emphatically rejects one of the major tenets of Chomskian linguistics, namely that the linguists introspection provides the only appropriate basis for describing language, insofar as "information about the speaker-hearer's competence ... is neither presented for direct observation nor extractable from data by inductive procedures of any known sort" (Chomsky 1965: 18). Corpus users have taken varying positions on these issues, ranging from the 'weak' view, that sees corpus data as complementing the 'armchair' linguists intuitive insights by providing real-life examples and a reliable testbed for hypotheses (see 8.1 on page 143), to the 'strong' view, according to which corpus data should a1ways override intuition, and discussion should be confined solely to naturally-occurring examples. In either case, corpus-based work has wider affinities than many other branches of linguistics, since the study of language-in-use has something to offer historical, political, literary, sociological, or cultural studies, and has profited from the resulting synergy.
        Our discussion focuses on four application areas: the emergence of collocation as a key component in linguistic description; the opportunities afforded by corpus-based methods for contrastive studies of different languages, varieties and registers; the use of corpora in natural language processing (NLP); and finally, their use in foreign language teaching.

1.3.3 Collocation
One of the forefathers of contemporary corpus linguistics, J.R. Firth, observed that part of the meaning of the word 'ass' consists in its habitual collocation with an immediately preceding 'you silly' (Firth 1957: 11). ('Whether this use 15 still current some fifty years later is a question the BNC can answer: there are in fact only 8 occurrences of 'silly ass' in the corpus, none of them preceded by 'you'.) There are a great many cases in English where the occurrence of one word predicts the occurrence of another, either following or preceding it. Kjellmer (1991) notes such examples as 'billy', which predicts 'goat' or 'can' following, and 'bail', which predicts 'jump' or 'stand' preceding. Such collocational patterns tend to be highlighted by KWIC concordances, since these show just the few words which precede and follow the keyword or focus, and can typically be sorted according to these words. It is also relatively easy to calculate the frequency with which a particular collocate appears within a certain range of the focus - its collocation frequency within a given span - and to compare such frequencies to find the most common collocates occurring, say, UP to two words before 'ass'.
        Jones and Sinclair (1974) claim that the probabilities of lexical items occurring in English are generally affected by collocational norms within a span of up to four words. Co-occurrence of two or more words within a short space can be important insofar as that co-occurrence is expected and typical (whether in the language in general, in a particular text-type, or in the style of a particular speaker or author), or insofar as it is unexpected and atypical. Sinclair (199 1) argues that recurrent collocational patterns effectively distinguish different senses of the same word - a 'silly ass', while potentially a quadruped, is statistically a biped - and that consequently collocational frequencies can be used to disambiguate word senses. In this he builds on Firths view that for the lexicographer, "each set of grouped collocations may suggest an arbitrary definition of the word, compound or phrase which is being studied" (Firth 1957: 196). From a converse perspective, deviation from a collocational norm - 'since breakfasts immemorial', say - can be a means of generating particular effects, such as irony (Louw 1993).

        The tendency for one word to occur with another has both grammatical and semantic implications. The collocation of a word with a particular grammatical class of words has been termed colligation. For instance, unlike 'look at', the verb 'regard' appears always to colligate with adverbs of manner, as in 'She regarded him suspiciously' (Bolinger 1976). From a semantic perspective, the habitual collocations of some words mean that they tend to assume the positive or negative connotations of their typical environments - a particular semantic prosody. For example, Sinclair (199 1) notes that the verb 'set in' has a negative prosody, because things which typically set in are 'rot', 'decline' etc., making it extremely difficult to use this verb with positive implications. In the same way, the typical collocations of many apparently neutral terms may reveal deep-seated cultural prejudices: Stubbs (1996: 186ff.) notes how the high-frequency collocates of terms such as 'Welsh' or 'Irish' tend to reinforce nationalistic stereotypes.

        Other than in set phrases, collocations and their frequencies are not generally accessible to intuition. They can however be easily identified and quantified by computational methods, in corpora which are sufficiently large for the purpose. Work based on the Birmingham collection of English texts, revealing the extent of collocational patterning in English, has contributed to change current views of psycholinguistic organization, by providing important evidence that lexical items are to a large extent co-selected rather than combined individually, following what Sinclair terms an idiom principle rather than an open-choice one. A collection of concordances showing the most frequent collocates of some 10,000 words in the Bank of English has recently been published on CDROM (Cobuild 1995). Much discussion and research has also been dedicated to the development of appropriate measures of the strength of collocational links Punning 1993; Stubbs 1995), and to the automatic listing of significant collocations.

1.3.4 Contrastive studies
The construction of the LOB corpus of British English, on closely parallel lines to the Brown corpus of American English, and their subsequent morphosyntactic annotation (see 1.4.2 on page 24), stimulated a variety of comparative studies, facilitated by the wide distribution of both corpora on a single CD-ROM by ICAME, a highly influential organization of European corpus linguists based at the University of Bergen in Norway. This section reviews some examples of contrastive studies, involving both different corpora and different components of a single corpus, with the purpose of illustrating some of the methodological issues involved.

Comparing geographical varieties and languages Hofland and Johansson (1982) and Johansson and Hofland (1989) report detailed studies of word frequencies in the Brown and LOB corpora, showing, for instance, that 49 of the 50 most frequent words in each corpus are the same. Contrasts concern not only such areas as spelling (e.g. 'colour' vs. 'color'), and different choices of synonyms (e.g. 'transport' vs. 'transportation', 'film' vs. 'movie'), but also different subject matter (e.g. 'tea' vs. 'coffee', 'London' vs. 'Chicago'). Leech and Falton (1992) suggest that some of these differences in frequency may indicate cultural, rather than simply linguistic differences. Noting, for instance, the considerably more frequent use in Brown of military terms, such as 'armed', 'army', 'enemy', 'forces', 'missile(s)', 'warfare', they suggest that this may reflect a greater concern in the US with military matters (remembering that 1961, the year of the Brown texts, was also that of the Cuban missile crisis). And faced with the greater frequency in LOB of conditional and concessive conjunctions ('if', 'but', 'although', 'though') and words denoting possibility or uncertainty ('possible', 'perhaps', 'unlikely' etc.), they speculate that this may conform to the stereotype of the "wishy-washy Briton who lacks firmness and decisiveness" (Leech and Falton 1992: 44).
        In reaching these (tentative) conclusions, they note that a relatively small number of words can be analyzed in this way. The LOB and Brown corpora each contain only 50,000 word types (less than the number of headwords in a single-volume dictionary), and among the less frequent words, relative frequency or infrequency may be due to sampling bias. Even at higher frequency levels, differences may be the product of a skewed distribution across texts. While the influence of these sources of error can be reduced by comparing groups of words identified by semantic or other criteria, careful examination of concordances remains necessary to check how often the word or words in question are in fact used with a particular sense - not all occurrences of the word 'film' refer to cinema, and 'tea' is a meal in some parts of Britain, as well as a drink.

Multilingual, comparable and parallel corpora There is an increasing tendency to apply corpus techniques to the task of comparing different languages. Where a corpus consists of texts selected using singular criteria in two or more languages, comparisons can be made at many different levels, ranging from lexicogrammatical preferences to rhetorical organization. One particularly interesting type of multilingual corpus is the parallel corpus, consisting of texts that are actually translations of each other: prototypical instances are official documents produced in multilingual environments, such as the UN and EU, or the Canadian Hansard, which is published in both English and French. Such corpora have clear utility for the study of translation itself, as well as providing a useful focus for contrastive studies of the differences between particular languages.
        To facilitate comparison, the texts in parallel corpora may be aligned, identifying equivalences on a sentence-by-sentence, phrase-by-phrase, or word-by-word basis, and much effort has gone into the development of software to align parallel texts automatically. The major fields of application have so far been in developing and testing machine translation packages and producing computerized translation aids (such as bilingual dictionaries and terminology databanks), but such corpora also have much to teach about the universals and specifics of language, and the process of translation. For instance, the English-Norwegian parallel corpus project Johansson and Hofland 1993; Johansson and Ebeling 1996; Aijmer et al 1996) lists among its fields of investigation not only the similarities and differences in the lexicogrammatical, rhetorical, and information structure of texts in the two languages, but also such questions as:

To what extent are there parallel differences in text genres across languages?

In what respects do translated texts differ from comparable original texts in the same language?

Are there any features in common among translated texts in different languages (and, if so, what are they)?

Comparing diachronic varieties English-language corpus-building dates back at least thirty years. The continued availability of these pioneering corpora has made possible a range of contrastive studies investigating changes in the English language over time. There is also a growing interest in the construction of specifically designed diachronic corpora which sample language production over much longer time periods.
        Examples of the first kind include a 1991 version of LOB, using identical sampling criteria as far as possible, recently completed at Freiburg (Mair 1993). As well as facilitating diachronic comparison of particular linguistic features, such corpora may also provide a useful yardstick for comparing studies based on the larger corpora of recent years with ones based on the smaller corpora which preceded them.
        Notable examples of the second kind include the Archer corpus (Biber, Finegan and Atkinson 1994), which contains samples of eleven different registers from different historical periods, and the Helsinki corpus of English texts: diachronic and dialectal (Kytö 1993: see 1.3.1 on page 10). Kytö et al (1995) provides a useful checklist of new projects in this expanding field.
        Change over time can also be investigated by contrasting the usage of different age groups, as further discussed in the next section.

Comparing categories of users Where corpora provide information as to the social and linguistic provenance of speakers and hearers, or of writers and readers, they can be used to compare the language of different groups according to such variables as region, age, sex, and social class, provided that a demographically balanced sample of language users has been taken. For instance, Stenström (1991) used the London-Lund corpus to study the relationship between gender and the use of expletives relating to religion, sex and the body in speech. She found that while female speakers tended to use such words more often than male, they tended to choose expletives from a 'heaven' group, while male speakers used ones relating to 'hell' and 'sex'. Investigating the function of these words in the light of their position in the utterance, she also found that the male speakers used them to emphasize their own contributions, whereas female speakers used them to give responses and invitations to continue.

Comparing different uses of language The construction of the London-Lund corpus of spoken English spawned a large number of studies comparing speech with writing, generally using the LOB corpus as evidence of the latter. These have highlighted differences in the relative frequencies of words and structures in the two modes (in speech the most common word is T, while in writing it is 'the'), as well as facilitating the identification and description of features whose use appears to be specific to the spoken language, most notably discourse-structuring elements such as 'well', 'I mean', and 'you know'.
        Corpora have also been extensively used to investigate the ways in which genres differ linguistically, attempting to characterize genres by the relative frequency of particular features. Conversely, insofar as, texts can be categorized statistically according to linguistic features, what correspondence is there between those categorizations and the lists of text-types employed, for instance, in corpus design? Biber (1988) compared the frequencies of a range of linguistic features which can be automatically counted (word length, type/token ratio, nouns, prepositions, first/second/third person pronouns, tense, voice, aspect, 'wh-' relative clauses, synthetic vs. analytic negation, 'private' vs. 'public' verbs, etc.) in samples from texts of different types in the LOB and London-Lund corpora, using cluster and factor analysis to identify elght basic classes of text, grouped by similar scores on five dimensions of variation. Biber and Finegan (1994) employ similar methods to investigate variation within texts, showing that in medical research articles, frequencies of a number of lexicogrammatical phenomena vary according to the section of the article sampled.

1.3.5 NLP applications
There has recently been an increased awareness of the potential which corpus methods offer for tackling a number of problems in the field of natural language processing (NLP), that is, the development of automatic or semi-automatic systems for analyzing, 'understanding', and producing natural language.
        Corpora are increasingly used in the development of NLP tools for applications such as spell-checking and grammar-checking, specch recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval, and machine translation. A major problem to be faced by all NLP systems is that of resolving ambiguity, be this selecting which of two or more possible orthographic transcriptions might match a given acoustic input ('whales', 'Wale"s', or 'wails'?), or deciding whether an instance of the word 'bank' refers to a financial institution or a landscape feature, and hence how it should be translated into, say, French, or how the text that contains it should be classified for retrieval purposes.
        The limited results achieved in such areas using traditional rule-based models of language have led to an increasing interest in probabilistic models, where probabilities are calculated on the basis of frequencies in corpus data (Church and Mercer 1993). Traditional spell-checkers, for instance, are based only on a dictionary of possible orthographic forms in the language, so that they fail to recognize errors which are nonetheless acceptable forms (such as 'form' for 'from'). Performance in such cases can be improved by considering the probability that the form typed by the user will occur after the previous word, where this probability has been calculated by analyzing a corpus for the language concerned. (For instance, it is highly unlikely that the word between 'the' and 'typed' in the previous sentence could be 'from'.)
        The analysis of frequencies of particular features in corpora underlies a wide variety of NLP applications based on probabilistic techniques, such as:

categorization of specific texts, for instance by identifying their type, semantic field, and keywords as a basis for automatic indexing and abstracting, or extracting terminology;

refinement of question-answering and information retrieval systems, enabling them to employ or suggest additional or alternative search terms to interrogate textual databases, on the basis of collocational regularities in corpus data, and to filter retrieved information by checking its conformity to the typical collocational patterns of the search terms proposed;

improvement of multilingual retrieval of texts, and identification of terminological equivalents in different languages, on the basis of lexical and collocational equivalences identified in parallel multilingual corpora (see 1.3.4 on page 15).

        A further use of corpora in NLP is as testbeds to evaluate applications, be these theoretically motivated or probabilistic. Probabilistic models of language can, to a certain extent, be self-organizing, and in this respect corpora can provide training instruments for software which learns probability through experience, or refines an initial model in a bootstrapping process (Atwell 1996). A system which needs to disambiguate the term 'bank', for instance, can analyze a corpus to learn that the landscape sense generally collocates with 'river', Tower', etc., while the financial one collocates with 'merchant' and 'high street'. Such uses typically call for substantial annotation of corpora in order to reduce ambiguity in the training materials, and NLP applications in many cases overlap with applications designed to annotate corpora in various ways (see 1.4.2 on page 24).

1.3.6 Language teaching
The growing variety of corpus applications in the field of Englisì language teaching is reviewed by Murison-Bowie (1996). Corpora have already had a considerable influence in the creation of new dictionaries and grammars for learners, where the use of corpus data has allowed:

more accurate selection of words and senses for inclusion, based on frequency of occurrence;

introduction of information concerning the relative frequency of each word and of the different senses of each, and their use in different genres and registers;

citation of actual rather than invented examples, selected to illustrate typical uses and collocations.

        Sinclair (1987) provides a detailed discussion of these issues in reference to the creation of the Collins Cobuild Díctionary.

        Kennedy (1992) reviews the long tradition of pre-electronic corpus work in language teaching. Many of the studies he discusses aimed to identify the most frequent words and grammatical structures in the language, with a view to optimizing the design of syllabuses and the grading of materials. Such goals have received new impetus from the availability of electronic corpora. Analysis of the Birmingham collection of English texts; underlay the selection of the lexical syllabus' proposed by Willis (1990); Grabowski and Mindt (1995) used the Brown and LOB corpora to create a list of irregular verbs ordered according to frequency, arguing that by following this order in syllabus design, teaching should achieve maximum yield for the students effort, irrespective of when the learning process is broken off.

        Corpus data have also provided a means of evaluating conventional syllabuses. Ljung (1991) compares the lexis of textbooks of English as a foreign language with that of a corpus of non-technical writing, while Mindt (1996) compares the treatment of future time reference in textbooks and learner reference grammars with corpus data. Such studies use corpora to highlight actual frequency of occurrence, which, while not the only criterion for deciding syllabus content or the form of materials (Widdowson 1991), can clearly provide teachers and textbook writers with an important tool to assess the pedagogic suitability and adequacy of particular choices (Biber et al 1994).

        There is also a growing interest in providing teachers and learners with direct access to corpora as resources for classroom or individual work. Fligelstone (1993) suggests that learners can use corpora to find out about the language for themselves and hence to question prescriptive specifications, for instance by exploring the nature of idioms and collocations, rhetorical questions, the use of sentence-initial 'and', etc. Similarly, Aston (in press) argues that with appropriate training, advanced learners can use large corpora as reference tools which overcome many of the limitations of existing dictionaries and grammars by providing a much larger number of more contextualized examples. Corpora may not only be a source of information about the language in question: Fligelstone notes that they can also provide encyclopedic knowledge, making them a useful tool to gather ideas about a subject in order to write or talk about it, while Aston (1995) suggests that concordancing software enables learners to browse the corpus texts in a serendipitous process where they not only analyze language but experience it as communicative use. In such ways the growing availability of corpora offers learners a new kind of resource which can complement the traditional dependency on teacher, textbook, and reference book.

1.4 How should a corpus be constructed?
We noted above that a corpus is not a random collection of text. In this section, we review some of the major issues relating to corpus construction. We discuss first some basic design principles, concerning size, sampling practice, and composition, and then consider the various kinds of encoding, annotation, and transcription policies which may be adopted.

1.4.1 Corpus design
In designing a corpus to address a particular purpose, two groups of criteria must be considered. On the one hand the size of the corpus and of its component parts, and on the other the material actually selected for inclusion, may each have crucial effects on its usabiIity.

Corpus size and sample size The frequency of different word forms in a corpus generally follows a Zipfian distribution (Zipf 1935), whereby the second most frequent word occurs approximately half as often as the most frequent one, the third most frequent word approximately one third as often, and so on. All but the most frequent words are extremely rare. Corpora therefore need to be very large and heterogeneous if they are to document as wide as possible a range of uses of as many linguistic features as possible.
        Even where they are relatively frequent, features which are unevenly distributed across different types of text in the corpus may not be adequately represented. Sinclair (1991: 24) notes of the Brown and LOB corpora that they only provide reliable sources of information concerning relatively frequent words that occur in a wide range of texts. They are much less reliable for words which occur only in certain text-types, because "the sub-categories necessary to balance the sample are not in themselves reasonable samples because they are too brief". Increasing corpus size can go some way to solving these problems by providing larger samples for each sub-category.
        While an increase in size provides more data, it also tends to entail less detailed analyses: it is striking how many descriptive studies have analyzed only small corpora (or small samples of larger ones), often because of the need to inspect and categorize data manually. The analysis of larger corpora is heavily dependent on the use of automatic or semi-automatic procedures able to identify particular linguistic phenomena (see 1.4.2 on page 24). The availability of such procedures is still limited in many areas.
        Sinclair (1991: 24ff.) has argued that the static sample corpus consisting of a fixed collection of data should ideally give way to the monitor corpus where information could be gleaned from a continuous stream of new text "as it passes through a set of filters which will be designed to reflect the concerns of researchers." In the lexicographical field, for instance, procedures might be designed to capture new word forms or usages, and shifts in frequency of use. Such a corpus would allow the user to detect phenomena which would be inadequately represented in even a very large sample corpus, and to monitor changes in the language as they took place. The Bank of English project at the University of Birmingham puts this idea into practice (Sinclair 1992). At the time of writing, this corpus contains over 300 million words, but is continually expanding and being monitored by a set of software tools which categorize incoming data automatically for particular purposes.
        In order to include both a wide range of text-types and a large number of different texts of each type, early corpora included relatively brief extracts from each text sampled: the one núlhon-word Brown and LOB corpora each consisted of randomly-selected 2000-word samples from 500 texts, of 15 different types. A corpus composed of short samples of equal length is less likely to give skewed results due to the influence of particular source texts, but is of little use for the study of large textual features such as narrative organization, or of within-text variation (Biber and Finegan 1994). To permit the study of such phenomena, Sinclair (1996) has argued that large corpora should be composed of whole texts wherever possible.
        The continued growth in the size of corpora has generally implied an increase in sample sizes as well as in the number of samples. However, the inclusion of complete texts may not always be possible, either for copyright reasons, or because the notion of 'completeness' is inappropriate or problematic (Is a newspaper a complete text? Is each story in a newspaper a complete text?). Complete texts may also vary greatly in size, giving rise to problems of balance.

Corpus composition A corpus which claims to 'characterize the state of a language' must define both the linguistic universe which has been sampled and the sampling procedures followed. Is it intended to characterize only the speech and writing of competent native-speakers? If so, how are the latter to be defined? Is it to include as wide a variety of different types of language as possible? Should its composition reflect the relative frequencies with which these different types occur? And if so, should these frequencies be calculated on the basis of reception (the language people hear and read), production (the language people speak and write), or both? In the first case priority will be given to those text-types which are most widely and frequently experienced, such as casual conversation, everyday workplace and service encounters, television, radio and the popular press. In the second case, while much of the everyday dialogue content may be similar, the rest of the corpus may look very different, since most texts are produced for small audiences.
        Many of the criteria for the composition of a corpus are determined by its intended uses. The Survey of English Usage aimed to describe the grammatical repertoire of adult educated speakers of English with the aid of a corpus which was "reasonably representative of the repertoire of educated professional men and women in their activities, public and private, at work and at leisure, writing and speaking" (Quirk 1974: 167). Given these goals, it was designed to sample a wide range of text types, deciding the proportions of each largely on their assumed frequency of production in the language as a whole. Most large mixed corpora have tended to follow the Survey in aiming to cover users' production repertoires (though not merely the educated, professional and adult) by drawing up a list of text types to be included, deciding the proportions of the corpus to be constituted by each, and then selecting texts for each type using a combination of random and controlled sampling techniques.
        There are, of course, many different ways of characterizing texts, and hence many text typologies. One of the more fundamental distinctions is between spoken and written materials. The Survey of English Usage contained equal quantities of each, and subsequent compilers of mixed corpora have generally agreed that ideally this proportion should be respected - though there is no particular reason for thinking that writing and speech are equally present in either production or reception of the language as a whole. However, since it is for the moment much more expensive to obtain speech data, which has to be recorded and transcribed, than written texts, many of which are already available in machine-readable form, large mixed corpora generally contain much smaller proportions of speech.
        More complex text typologies have been based on such concepts as the field (that is, the topic and purpose) of the text; on sociolinguistic factors determining its tenor (for example, the context in which the text is produced or received, the participants and their inter-relationships, etc.); or on its mode (that is, whether speech is monologue or dialogue, face-to-face or broadcast, prepared or spontaneous; or whether writing is published or unpublished). Atkins et al (1992) list as many as 29 parameters to be considered in constructing a 'balanced' corpus. Within each category, reception criteria may be used to complement production ones, for instance by preferring bestsellers to remaindered novelettes when sampling published fiction.
        There is ample evidence that word frequencies and other linguistic features vary widely within different text-types, both with respect to each other, and with respect to the whole of a corpus, whatever typology is employed. For instance, in most forms of speech, 'sure' is more common than 'certain', while in written social science texts the opposite is the case (Biber and Finegan 1989). Biber (1993) demonstrates that analyses based on restricted samples cannot be generalized to language as a whole. It has consequently been argued that a balanced corpus is useful in a specific application only to the extent that it includes an adequate sample of the category in question, which can be separated out and treated as a corpus in its own right. However, balanced corpora can at least provide a baseline against which variation amongst pre-defined categories can be measured. Halliday (1992: 69) argues "if we recognize departure from a norm, then there has to be a norm to depart from. If we characterize register variation as variation in probabilities, as I think we must, it seems more realistic to measure it against observed global probabilities than against some arbitrary norm such as the assumption of equiprobability in every case."

1.4.2 Encoding, annotation, and transcription
Simple lexical analysis of a corpus of written texts requires only a computer-processable version of the text. However, for the full range of analytic possibilities sketched out in section 1.2 on page 5 above, some thought must also be given to the ways in which the text and its context are to be encoded, that is, the way in which particular features of them can be made explicit and hence processable. This is especially important for corpora derived from spoken language, where the process of transcription immediately confronts the analyst with many difficult theoretical and methodological issues.

Encoding and annotation A corpus may simply consist of sequences of orthographic words and punctuation, sometime known as plain text. However, texts are not just sequences of words; they have many other features worthy of attention and analysis. At the very least, we want to distinguish and describe the different texts in the corpus, as well as their different components. Such text descriptions may include bibliographic details, details of the nature of the sample (complete or otherwise), or classification in terms of the parameters employed in designing the corpus.
        When describing the components of written texts (other than words), it is useful to indicate the boundaries of chapters, sections, paragraphs, sentences, etc., and the specialized roles of headings, lists, notes, citations, captions, references, etc. For spoken texts, indications of the beginnings and ends of individual utterances are essential, as is an indication of the speaker of each. It may also be desirable to encode paralinguistic phenomena such as pausing and overlap, and non-verbal activity such as laughter or applause. For either kind of text, it may be helpful to include editorial information about the status of the electronic text itself, for example to mark corrections or conjectures by the transcriber or editor.    
        A further type of information which may be provided is linguistic annotation of almost any kind, attached to components at any level from the whole text to individual words or morphemes. At its simplest, such annotation allows the analyst to distinguish between orthographically similar sequences (for example, whether the word 'rank' at the beginning of a sentence is a proper name or an adjective), and to group orthographically dissimilar ones (such as the negatives 'not' and '-n't'). More complex annotation may aim to capture one or more syntactic or morphological analyses, or to represent such matters as the thematic or discourse structure of a text.
        Types of linguistic annotation that have been employed with corpora include the following:

part-of-speech or word-class Placing a tag alongside each word in the corpus to indicate its word-class can disambiguate different grammatical uses of a word such as 'works', which may be a plural noun or a singular verb. Such part-of-speech tagging can also help identify grammatical patterns. While many existing corpora have been manually or automatically tagged in this way, no standard set of part-of-speech tags has yet been defined.
lemmatization and morphological analysis Lemmatization involves the use of tags to indicate the relationship of each word-form to its root (e.g. that 'took' is the past tense form of 'take'). It enables derived and inflected forms of a word to be retrieved and counted along with its root. While lemmatization provides a useful way of grouping data for many descriptive purposes, particularly in languages with many declensions and conjugations, corpus-based research suggests that different forms of lemmas do not always share the same meaning, but tend to occur in distinctive contexts (Sinclair 1991).
word-sense Tags can also be used to distinguish different senses of the same word (c.g. 'table' meaning 'piece of furniture' as opposed to 'data in rows and columns'), on the basis of an existing dictionary or thesaurus. While word-sense annotation is quite extensively used in the fields of machine translation and information retrieval (Guthrie et al 1994), it clearly prejudges the issue for corpora whose goals include lexicographic description.
syntactic role The parsing of corpora involves the addition of annotation to indicate the grammatical function of each word, describing the structure of each sentence as a set of labelled bracketings or tree. A number of small parsed corpora have been derived from pre-existing larger corpora. These include the Gothenberg corpus (Ellegård 1978) derived from parts of Brown; the LOB corpus treebank (Leech and Garside 1991) derived from LOB; and the Susanne corpus (Sampson 1994), which combines results from both. While most parsing has been done by hand, considerable research effort has gone into the development of automatic parsers. Notable examples include the English Constraint Grammar Parser developed at Helsinki (KarIsson 1994) and the TOSCA system developed at Nijmegen (van Halteren and Oostdijk 1993). As syntactic analysis is often necessary to decide who did what to whom in English, parsed corpora have an important role in many NLP applications. Parsing schemes are, however, highly theory-dependent, and there is relatively little consensus in the field.
pragmatic annotation In a sentence like 'Not there they won't', pronouns, deixis, and ellipsis refer to concepts which are (probably) more fully expressed elsewhere in the text. IdentifyIng those concepts is often important for natural language understanding systems and for machine translation. Substantial work has been carried out on procedures to insert pragmatic annotation linking such anaphoric features to their antecedents (Fligelstone 1992; Garside 1993). Annotation indicating other pragmatic features such as the function of individual speech acts and overall discourse structure (Coulthard 1994) has been so far limited to very small samples, given the lack of consensus as to relevant units and categories, and of explicit algorithms for their identification.

        Much of the concern with annotated corpora derives from the need to provide training materials and testbeds for software which will annotate text automatically for various NLP applications (see 1.3.5 on page 18). Some descriptive linguists have on the other hand argued that corpora should only be made available as plain text, given that annotation always involves introducing an interpretation of some kind (Sinclair 1991).

Transcribing speech No transcript, however detailed, is able to provide all the information on which hearers draw when understanding speech (Cook 1995). French (1992) proposes four levels of transcription, ranging from a broad orthographic representation of the words spoken, to a narrow phonetic transcription with detailed prosodic information. The act of transcribing spoken data is in itself a kind of encoding, making explicit an interpretation of the original sound wave. Even a broad transcription generally implies, for instance, distinguishing different speakers' utterances, dividing these into sentences or prosodic units, and dividing the latter into words, with a disambiguation of homophonous elements with distinct orthographic forms. Narrower transcription involves additional encoding of prosodic features, such as stress and intonation, as well as of paralinguistic ones such as changes in voice quality, pausing, and nonvocal events (phone rings, applause, shifts in position and eye-contact, etc.).
        For studies of phonology and dialectology, an orthographic transcription is clearly unlikely to be adequate unless supported by more detailed phonemic or phonetic data. Such coding systems are not discussed here. Their complexity, requiring highly-specialized expertise, has meant they are only currently available for relatively small corpora of speech. Given its costs, there has generally been a trade-off between size and detail in corpus transcription. Edwards and Lampert (1993) provide a detailed survey of different transcription systems, both in terms of the features encoded and their representation (for a summary, see Edwards 1995).