Anglistik Home Page

© Copyright Michael Stubbs 2000.

Readers are welcome to print individual copies of this paper for private study. Reference should always be made to the original place of publication which is:

C Heffer & H Saunston eds (2000) Words in Context: A Tribute to John Sinclair on his Retirement. English Language Research Discourse Analysis Monograph 18. University of Birmingham. [CD-ROM.]


Michael Stubbs

FB2 Anglistik, Universität Trier, D-54286 Trier, Germany


This paper illustrates two principles to bear in mind when drawing conclusions from corpus data. (1) It is unwise to rely on a single corpus, however large or well designed it might be: all corpora have in-built biases, and findings should therefore be checked in different independent corpora. A subsidiary point is that it is sometimes necessary to check findings in very large text collections. (2) A method of study should be appropriate to its object of study. Here, this means that methods should be sensitive to the main broad finding of corpus linguistics, that there is a layer of organization between lexis and syntax, which is variously called lexico-grammar and extended lexical units (Sinclair 1998), idiom schemas (Moon 1998), and semantic coagulations (Teubert 1999). A related finding is that these schemas often have evaluative connotations. Corpus methods must be sensitive to this pragmatic turn (Weigand 1998) in work on lexis.

I will illustrate these two principles with examples from the following data-base, corpora and text collections:

Several terms for different kinds of corpus have become fairly standard (Sinclair 1995). The only distinction which I need here is between a corpus and a text collection. A corpus has been designed on given criteria for linguistic research. Usually these criteria are external and sociolinguistic, and based on a theory of text-types. Examples of corpora are COBUILD and BNC. A text collection has not been designed for any linguistic purpose. It has been put together, or has simply accumulated, for some independent reason, and might be used, opportunistically, for linguistic purposes. An example is TIMES. (Newspapers on CD-ROM may constitute a very good, homogeneous sample of specific text-types.) The largest available text collection is the World Wide Web itself.

Here then are three examples which illustrate the two principles.


A major finding of corpus linguistics is that all words occur in predictable collocations. There are very few absolutely fixed phrases: words have typical uses, and occur in central semantic patterns, but these patterns almost always have considerable lexical variation. The hypothesis is that there are simple underlying semantic units (for which relatively small corpora may provide sufficient evidence), but that these units typically show considerable surface lexical variation (for which very large text collections may be required as evidence).

For example, as noted briefly by Fillmore (1997), the phrase 'ripe old age' usually has very positive connotations. It occurs in a longer schema: "people want to live to a ripe old age" and/or "people are admired for living to a ripe old age". Attested examples of such uses are:

                            hoping to live to a ripe old age
                      enjoying good health to a ripe old age
                     if you expect to live to a ripe old age
           stand a better chance of living to a ripe old age
                                  living to the ripe old age of 70 years
                                    reached the ripe old age of 75
                         until his death at the ripe old age of 87

The admiration (and sometimes slight envy?) is signalled by the lexis in examples such as

A general model which is useful in showing the construction of such patterns is proposed by Sinclair (1998). Words occur in extended lexical units which comprise a configuration of characteristic items from

Thus, the phrase 'ripe old age' frequently occurs with the individual collocates 'LIVE to' or 'REACH', but also with other semantically related verbs such as 'ATTAIN', 'SURVIVE to', 'GO on to'. There are preferred colligations: often a preceding verb plus preposition plus determiner. There are semantic preferences: often verbs such as 'ASPIRE', 'HOPE', 'INTEND', 'STRIVE' and 'WANT', and/or words concerning dangers and risks, such as 'death', 'maximum life-span', and 'perils of infancy'. Other vocabulary may explicitly express the positive discourse prosody and the speaker's admiration for the achievement: reaching a 'ripe old age' is a good thing to do.

It is easy to state the prototypical semantic pattern, but impossible to list all the variant phrases. First, the collocates above are only some possibilities. Second, there are further variants of the core phrase. The only obligatory word is 'age'. The phrase 'ripe old age' is frequent, but 'ripe age', 'grand (old) age' and 'good (old) age' also occur. Therefore a very large text collection may be required to show the relative frequency of variants, since even if words are individually quite frequent, collocations of these words may drop to zero in corpora as large as 100-million words. Consider the frequencies of related phrases in four sets of material:

                             TIMES   COBUILD    BNC     ALLTHEWEB

           ripe old age        15      11        33        7965
           good old age         1       1         7        1593
           grand old age        5       2         7         541

           LIVE to a/the
              ripe old age      4       1        12        1791
              good old age      1       1         1         250
              grand old age     0       0         0          40

           REACH a/the
              ripe old age      0       1         3         454
              good old age      0       0         0          19
              grand old age     1       0         2          84

Only ALLTHEWEB provides enough examples to show with confidence that all these variants do occur, and which are more or less frequent (and these variants are only a small sample of the possibilities).


Here is a second example, which shows that it may be possible to collect solid supporting evidence for a schema from a relatively small corpus, but that a much larger corpus may be required to check for potential counter examples. The verb 'UNDERGO' typically occurs in a simple semantic schema in which people or things "involuntarily UNDERGO something serious and unpleasant". If the subject of the verb is a human individual, then the most frequent object noun phrase is some medical procedure, and the most frequent individual noun collocate is 'surgery'. Here are a few examples:

                     is to undergo a historic transformation
                   did not undergo a major metamorphosis until
            asking them to undergo a medical examination
            is expected to undergo a psychiatric examination
                    had to undergo a stringent medical examination
           being forced to undergo an Achilles tendon operation
           are required to undergo an "eyescan" before being allowed
                    had to undergo brain surgery
               is about to undergo dramatic changes
              scheduled to undergo his eighth open heart surgery
               continue to undergo major cutbacks
                 forced to undergo random drug testing


Again, at least informally, it is easy to state the prototypical semantic schema:

           - involuntary -----------------------------------------------
           ----------------------- serious --- unpleasant --------------

           often                   often       usually
           PASSIVE or              ADJECTIVE   MEDICAL PROCEDURE,
           MODAL                               TESTING, CHANGE, etc

           forced to     undergo   further     surgery etc
           required to             extensive   medical or other testing
           had to                  major       training
           must                    severe      change
                                   etc         a trauma etc

This prototype is a hypothesis about the most typical uses of 'UNDERGO'. (In this case, rather exceptionally, all the forms of the lemma show very similar collocates.) This hypothesis is a prediction that we will find similar examples in other independent corpora. However, this is the easy part: any reasonably large corpus will provide dozens of further examples which support the hypothesis, but this would not tell us anything new. A hypothesis is also a prediction that we will not find any counter examples. Every good theory is a prohibition: a claim that certain things will not happen. So, the hypothesis has to be tested by looking deliberately for counter examples which would lead us to reject or modify the hypothesis. If we find them, we would learn something new. (Popper 1963.)

For example, do we find occurrences of the phrase 'willingly UNDERGO'? This might provide counter examples to the claim that the discourse prosody includes the unit "involuntary". Again, smaller text collections (of up to 100 million words) are of no help. There was only one example each in the TIMES and COBUILD data, and none at all in the BNC.


However, ALLTHEWEB provided 176 examples, such as

So, the phrase does occur, but it seems not to provide counter examples to the hypothesized schema for 'UNDERGO', and indeed provides evidence of a related prosody. The phrase often occurs in negatives, in questions or in hypothetical statements. The speaker is saying that they would not personally want to undergo some unpleasant experience. Alternatively they are expressing incredulity or admiration that someone could willingly undergo some unpleasant experience or sacrifice. The context is sometimes medical or military, but most frequently religious.

The phrase has a clear discourse prosody: "someone willingly undergoes a sacrifice for the sake of someone else". Examples include reports of parents who willingly undergo sacrifices for the sake of their children. The phrase 'for the sake of' occurs in several examples. So do related phrases such as 'to be able to', 'as evidence of', and 'as a sign that'. Co-occurring vocabulary which implies sacrifice (often religious) includes: 'martyrdom', 'atonement', 'a small price to pay', 'deprivation', 'the burden laid on them'.

I found also very similar patterns around the phrase 'cheerfully UNDERGO', where examples included:

The method proposed here is that of conjectures and refutations (Popper 1963): formulate a hypothesis, collect supporting evidence, then search for potential counter examples. Consider if they are genuine counter examples. If yes: reformulate the hypothesis. If no: keep searching!


In the case of 'willingly' and 'cheerfully UNDERGO' the pattern is clear, but only a very large text collection can provide the 40 or 50 examples which would be the minimum number required to study it. The largest text collection currently available is the World Wide Web itself, and one search engine ( claims (December 1999) to index 200 million documents. Clearly the Web is not a corpus: it has obviously not been designed on linguistic principles (it hasn't been designed at all). So, the question is: if it used, opportunistically, as a text collection (from which virtual sub-corpora can be formed to answer specific questions), can one draw valid conclusions from it about general language use?

The Web certainly has potential disadvantages as linguistic data. Many documents occur more than once (though this is also true of many corpora; and if a document is stored at different addresses, perhaps its language should be weighted accordingly). It is very largely written data (though there are transcribed versions of public news statements and the like). Not all documents are written by native English speakers (though this seems not to affect the examples cited above). And we have no real idea of what proportions it contains of different text-types, and no real idea how many running words it contains (though it might be possible to estimate this at least roughly by sampling word frequencies).

On the other hand, the Web certainly has potential advantages as a text collection. It is very very large, and growing. It is very mixed: it contains a wide selection of text-types, including material which is relatively rare in the designed corpora (e.g. many texts which are written, but not formally published, and therefore not professionally edited.) And even if we have only a very rough idea of what is in the whole collection, any individual example (phrase, collocation, etc) can be studied in its full co-text. The Web has considerable potential as a source of temporary and virtual corpora to study particular patterns.


My examples so far have shown problems which arise when a corpus does not contain enough examples of a pattern. Conversely, a corpus may contain too many examples of a specific pattern, because it contains too many examples of a specialized text-type. Smadja (1993: 169) notes an extreme example in a study of an 8-million word corpus of Associated Press news-wire stories, mainly about the stock market. The word 'food' was frequent, but 'EAT' was not among its collocates, since food is not 'eaten' on Wall Street but rather 'traded', 'sold', 'offered', 'bought', and so on. In general, it is well known that words differ in their collocational behaviour in different text-types.

Here is a more complex example. I was interested in the different collocations of different word-forms of a single lemma. In COBUILD COLLOCATIONS (Cobuild 1995), I noticed that the different forms of the lemma 'SEEK' were used quite differently. In this data-base, the top 20 collocates of the form 'seeks', in descending frequency, are

These collocates are frequent due to the word-form occurring in lonely hearts ads, such as

The collocates of 'seeks' hardly overlap at all with the collocates of 'seek', 'seeking' and 'sought'. But these three forms have 6 shared collocates, largely from political and legal contexts, in the semantic field of "help and support":

These findings are not a statement about the whole language, but about the text-types sampled in the corpus used for the data-base. Obviously, if the corpus had contained no magazines with lonely hearts ads, then there would be no such examples of 'seeks'. Equally obviously, the corpus must have contained enough examples to make these collocations more frequent than other collocations. It is therefore not surprising that different corpora show different patterns. The BNC contains rather different examples, from other personal adverts ('guitarist seeks working band') and from newspaper headlines ('Microsoft seeks partners'): these uses share the need to use short words. But there are also other examples from formal, including legal, texts, such as

The relatively formal language of the TIMES confirmed these findings: there were over 500 occurrences of 'seeks'.

The principles here are as follows. (1) If a corpus contains too many examples of a specialized text-type, this may give a misleadingly narrow view of the uses of the target word. (2) This type of bias will be increased by a method which looks at only the top 20 collocates of a word. Collocates further down the list may signal other uses.

(3) If a corpus claims to represent general usage, it should not contain too many examples of texts with a high percentage of unusual features, such as lonely hearts ads, knitting patterns, and weather forecasts. Such texts are not in daily use by substantial numbers of native speakers (Sinclair 1995: 24), and indeed many native speakers cannot fully understand their abbreviated lexis and unusual syntax.


These brief examples confirm the two principles which I proposed at the beginning of this note:

(1) Reliance on any single corpus is risky. It is best to combine: largish general corpora designed according to a sociolinguistic theory of text-type variation, small specialist corpora put together (possibly temporarily) for particular knowledge domains or text-types, and very large opportunistic text collections.

(2) Data collection should be sensitive to the types of patterns which corpus studies have shown to be characteristic of language in use. These patterns are semantic schemas which reside in an irreducible layer of organization between lexis and grammar. The prototypes of these schemas are simple and can sometimes be discovered with relatively small corpora (where relatively small = millions of running words). But to study their lexical variability, or to study the discourse prosodies around less frequent phrases, much larger text collections may be necessary.


This paper is a short footnote to work by John Sinclair, who argues that corpora should, in principle, be as large as is feasible, and who shows with many examples, that large corpora are often required to study the relations between lexis and syntax. I am grateful to Christine Spies for help with data collection and analysis. A version of this paper was presented to the Colloquium on Multilingual Corpora at the University of the Saarland, Saarbrücken, Germany, 21 January 2000: I am grateful to Erich Steiner and his colleagues for discussion on that occasion.


Cobuild (1995) Collins COBUILD English Collocations on CD-ROM. London: HarperCollins.

Fillmore, C. J. (1997) Lectures on construction grammar. Available on-line at (Accessed 11 May 1999.)

Moon, R. (1998) Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford: Clarendon.

Popper, K. R. (1963) Conjectures and Refutations. London: Routledge & Kegan Paul.

Sinclair, J. (1995) Corpus typology: a framework for classification. In G. Melchers & B. Warren eds Studies in Anglistics. Stockholm: Almqvist & Wiksell. 17-33.

Sinclair, J. (1998) The lexical item. In E. Weigand ed Contrastive Lexical Semantics. Amsterdam: Benjamins. 1-24.

Smadja, F. (1993) Retrieving collocations from text. Xtract. Computational Linguistics, 19, 1. Also in Armstrong, S. ed. (1994) Using Large Corpora. Cambridge, Ma: MIT Press. 143-77.

Teubert, W. (1999) Corpus linguistics: a partisan view. Available on-line at (Accessed 24 November 1999.)

Weigand, E. (1998) Contrastive lexical semantics. In E. Weigand ed Contrastive Lexical Semantics. Amsterdam: Benjamins. 25-44.

Anglistik, Universität Trier, D-54286 Trier, Germany.
Last up-dated March 2001.