Anglistik Home Page
© Copyright Michael Stubbs 2000.
Readers are welcome to print individual copies of this paper for private study. Reference should always be made to the original place of publication which is:
C Heffer & H Saunston eds (2000) Words in Context: A Tribute to John Sinclair on his Retirement. English Language Research Discourse Analysis Monograph 18. University of Birmingham. [CD-ROM.]
FB2 Anglistik, Universität Trier, D-54286 Trier, Germany
This paper illustrates two principles to bear in mind when drawing conclusions from corpus data. (1) It is unwise to rely on a single corpus, however large or well designed it might be: all corpora have in-built biases, and findings should therefore be checked in different independent corpora. A subsidiary point is that it is sometimes necessary to check findings in very large text collections. (2) A method of study should be appropriate to its object of study. Here, this means that methods should be sensitive to the main broad finding of corpus linguistics, that there is a layer of organization between lexis and syntax, which is variously called lexico-grammar and extended lexical units (Sinclair 1998), idiom schemas (Moon 1998), and semantic coagulations (Teubert 1999). A related finding is that these schemas often have evaluative connotations. Corpus methods must be sensitive to this pragmatic turn (Weigand 1998) in work on lexis.
I will illustrate these two principles with examples from the following data-base, corpora and text collections:
Several terms for different kinds of corpus have become fairly standard (Sinclair 1995). The only distinction which I need here is between a corpus and a text collection. A corpus has been designed on given criteria for linguistic research. Usually these criteria are external and sociolinguistic, and based on a theory of text-types. Examples of corpora are COBUILD and BNC. A text collection has not been designed for any linguistic purpose. It has been put together, or has simply accumulated, for some independent reason, and might be used, opportunistically, for linguistic purposes. An example is TIMES. (Newspapers on CD-ROM may constitute a very good, homogeneous sample of specific text-types.) The largest available text collection is the World Wide Web itself.
Here then are three examples which illustrate the two principles.
For example, as noted briefly by Fillmore (1997), the phrase 'ripe old age' usually has very positive connotations. It occurs in a longer schema: "people want to live to a ripe old age" and/or "people are admired for living to a ripe old age". Attested examples of such uses are:
hoping to live to a ripe old age enjoying good health to a ripe old age if you expect to live to a ripe old age stand a better chance of living to a ripe old age living to the ripe old age of 70 years reached the ripe old age of 75 until his death at the ripe old age of 87
The admiration (and sometimes slight envy?) is signalled by the lexis in examples such as
- it is a major triumph of the 20th century that many more people survive to a ripe old age
- he survived the perils of infancy to live to the ripe old age of 74
A general model which is useful in showing the construction of such patterns is proposed by Sinclair (1998). Words occur in extended lexical units which comprise a configuration of characteristic items from
Thus, the phrase 'ripe old age' frequently occurs with the individual collocates 'LIVE to' or 'REACH', but also with other semantically related verbs such as 'ATTAIN', 'SURVIVE to', 'GO on to'. There are preferred colligations: often a preceding verb plus preposition plus determiner. There are semantic preferences: often verbs such as 'ASPIRE', 'HOPE', 'INTEND', 'STRIVE' and 'WANT', and/or words concerning dangers and risks, such as 'death', 'maximum life-span', and 'perils of infancy'. Other vocabulary may explicitly express the positive discourse prosody and the speaker's admiration for the achievement: reaching a 'ripe old age' is a good thing to do.
It is easy to state the prototypical semantic pattern, but impossible to list all the variant phrases. First, the collocates above are only some possibilities. Second, there are further variants of the core phrase. The only obligatory word is 'age'. The phrase 'ripe old age' is frequent, but 'ripe age', 'grand (old) age' and 'good (old) age' also occur. Therefore a very large text collection may be required to show the relative frequency of variants, since even if words are individually quite frequent, collocations of these words may drop to zero in corpora as large as 100-million words. Consider the frequencies of related phrases in four sets of material:
Only ALLTHEWEB provides enough examples to show with confidence that all these variants do occur, and which are more or less frequent (and these variants are only a small sample of the possibilities).
TIMES COBUILD BNC ALLTHEWEB ripe old age 15 11 33 7965 good old age 1 1 7 1593 grand old age 5 2 7 541 LIVE to a/the ripe old age 4 1 12 1791 good old age 1 1 1 250 grand old age 0 0 0 40 REACH a/the ripe old age 0 1 3 454 good old age 0 0 0 19 grand old age 1 0 2 84
is to undergo a historic transformation did not undergo a major metamorphosis until asking them to undergo a medical examination is expected to undergo a psychiatric examination had to undergo a stringent medical examination being forced to undergo an Achilles tendon operation are required to undergo an "eyescan" before being allowed had to undergo brain surgery is about to undergo dramatic changes scheduled to undergo his eighth open heart surgery continue to undergo major cutbacks forced to undergo random drug testing
Again, at least informally, it is easy to state the prototypical semantic schema:
- involuntary ----------------------------------------------- ----------------------- serious --- unpleasant -------------- often often usually PASSIVE or ADJECTIVE MEDICAL PROCEDURE, MODAL TESTING, CHANGE, etc forced to undergo further surgery etc required to extensive medical or other testing had to major training must severe change etc a trauma etc
This prototype is a hypothesis about the most typical uses of 'UNDERGO'. (In this case, rather exceptionally, all the forms of the lemma show very similar collocates.) This hypothesis is a prediction that we will find similar examples in other independent corpora. However, this is the easy part: any reasonably large corpus will provide dozens of further examples which support the hypothesis, but this would not tell us anything new. A hypothesis is also a prediction that we will not find any counter examples. Every good theory is a prohibition: a claim that certain things will not happen. So, the hypothesis has to be tested by looking deliberately for counter examples which would lead us to reject or modify the hypothesis. If we find them, we would learn something new. (Popper 1963.)
For example, do we find occurrences of the phrase 'willingly UNDERGO'? This might provide counter examples to the claim that the discourse prosody includes the unit "involuntary". Again, smaller text collections (of up to 100 million words) are of no help. There was only one example each in the TIMES and COBUILD data, and none at all in the BNC.
- no-one, short of a severely psychotic masochist, would willingly undergo what she went through
- there is no way I'd willingly undergo a procedure that carries that risk with it
- why did he willingly undergo forty years of hardship?
- why willingly undergo the dangers and tortures of such a struggle, risk life itself?
- it is indeed difficult to understand how a thoughtful writer can willingly undergo the throes and agonies of ...
- one can willingly undergo some painful experience for one who is dearly loved
- Christ took upon him all the sins of the world and willingly underwent that grief of heart ...
- sufferings and dangers the early Christian willingly underwent for the sake of ...
- yea, he patiently suffers and willingly undergoes afflictions
So, the phrase does occur, but it seems not to provide counter examples to the hypothesized schema for 'UNDERGO', and indeed provides evidence of a related prosody. The phrase often occurs in negatives, in questions or in hypothetical statements. The speaker is saying that they would not personally want to undergo some unpleasant experience. Alternatively they are expressing incredulity or admiration that someone could willingly undergo some unpleasant experience or sacrifice. The context is sometimes medical or military, but most frequently religious.
The phrase has a clear discourse prosody: "someone willingly undergoes a sacrifice for the sake of someone else". Examples include reports of parents who willingly undergo sacrifices for the sake of their children. The phrase 'for the sake of' occurs in several examples. So do related phrases such as 'to be able to', 'as evidence of', and 'as a sign that'. Co-occurring vocabulary which implies sacrifice (often religious) includes: 'martyrdom', 'atonement', 'a small price to pay', 'deprivation', 'the burden laid on them'.
I found also very similar patterns around the phrase 'cheerfully UNDERGO', where examples included:
- what they cheerfully underwent for the sake of His Gospel
- cheerfully undergoing it for the sake of the country
- martyrs for Christ have cheerfully undergone extreme tortures
The method proposed here is that of conjectures and refutations (Popper 1963): formulate a hypothesis, collect supporting evidence, then search for potential counter examples. Consider if they are genuine counter examples. If yes: reformulate the hypothesis. If no: keep searching!
The Web certainly has potential disadvantages as linguistic data. Many documents occur more than once (though this is also true of many corpora; and if a document is stored at different addresses, perhaps its language should be weighted accordingly). It is very largely written data (though there are transcribed versions of public news statements and the like). Not all documents are written by native English speakers (though this seems not to affect the examples cited above). And we have no real idea of what proportions it contains of different text-types, and no real idea how many running words it contains (though it might be possible to estimate this at least roughly by sampling word frequencies).
On the other hand, the Web certainly has potential advantages as a text collection. It is very very large, and growing. It is very mixed: it contains a wide selection of text-types, including material which is relatively rare in the designed corpora (e.g. many texts which are written, but not formally published, and therefore not professionally edited.) And even if we have only a very rough idea of what is in the whole collection, any individual example (phrase, collocation, etc) can be studied in its full co-text. The Web has considerable potential as a source of temporary and virtual corpora to study particular patterns.
Here is a more complex example. I was interested in the different collocations of different word-forms of a single lemma. In COBUILD COLLOCATIONS (Cobuild 1995), I noticed that the different forms of the lemma 'SEEK' were used quite differently. In this data-base, the top 20 collocates of the form 'seeks', in descending frequency, are
- <female, black, male, attractive, similar, guy, lady, man, caring, professional, slim, intelligent, worldwide, friends, lesbian, woman, sincere, honest, good, non>
These collocates are frequent due to the word-form occurring in lonely hearts ads, such as
- female 31, single, seeks well educated gentleman
The collocates of 'seeks' hardly overlap at all with the collocates of 'seek', 'seeking' and 'sought'. But these three forms have 6 shared collocates, largely from political and legal contexts, in the semantic field of "help and support":
- <asylum, court, government, help, political, support>
These findings are not a statement about the whole language, but about the text-types sampled in the corpus used for the data-base. Obviously, if the corpus had contained no magazines with lonely hearts ads, then there would be no such examples of 'seeks'. Equally obviously, the corpus must have contained enough examples to make these collocations more frequent than other collocations. It is therefore not surprising that different corpora show different patterns. The BNC contains rather different examples, from other personal adverts ('guitarist seeks working band') and from newspaper headlines ('Microsoft seeks partners'): these uses share the need to use short words. But there are also other examples from formal, including legal, texts, such as
- where a buyer seeks to reject goods supplied under a sale contract ...
- in his Symphonic Etudes, he consciously seeks an orchestral sonority
The relatively formal language of the TIMES confirmed these findings: there were over 500 occurrences of 'seeks'.
The principles here are as follows. (1) If a corpus contains too many examples of a specialized text-type, this may give a misleadingly narrow view of the uses of the target word. (2) This type of bias will be increased by a method which looks at only the top 20 collocates of a word. Collocates further down the list may signal other uses.
(3) If a corpus claims to represent general usage, it should not contain too many examples of texts with a high percentage of unusual features, such as lonely hearts ads, knitting patterns, and weather forecasts. Such texts are not in daily use by substantial numbers of native speakers (Sinclair 1995: 24), and indeed many native speakers cannot fully understand their abbreviated lexis and unusual syntax.
(1) Reliance on any single corpus is risky. It is best to combine: largish general corpora designed according to a sociolinguistic theory of text-type variation, small specialist corpora put together (possibly temporarily) for particular knowledge domains or text-types, and very large opportunistic text collections.
(2) Data collection should be sensitive to the types of patterns which corpus studies have shown to be characteristic of language in use. These patterns are semantic schemas which reside in an irreducible layer of organization between lexis and grammar. The prototypes of these schemas are simple and can sometimes be discovered with relatively small corpora (where relatively small = millions of running words). But to study their lexical variability, or to study the discourse prosodies around less frequent phrases, much larger text collections may be necessary.
Fillmore, C. J. (1997) Lectures on construction grammar. Available on-line at http://www.icsi.berkeley.edu/~kay/bcg/lec02.html. (Accessed 11 May 1999.)
Moon, R. (1998) Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford: Clarendon.
Popper, K. R. (1963) Conjectures and Refutations. London: Routledge & Kegan Paul.
Sinclair, J. (1995) Corpus typology: a framework for classification. In G. Melchers & B. Warren eds Studies in Anglistics. Stockholm: Almqvist & Wiksell. 17-33.
Sinclair, J. (1998) The lexical item. In E. Weigand ed Contrastive Lexical Semantics. Amsterdam: Benjamins. 1-24.
Smadja, F. (1993) Retrieving collocations from text. Xtract. Computational Linguistics, 19, 1. Also in Armstrong, S. ed. (1994) Using Large Corpora. Cambridge, Ma: MIT Press. 143-77.
Teubert, W. (1999) Corpus linguistics: a partisan view. Available on-line at http://solaris3.ids-mannheim.de/ijcl/teubert_cl.html. (Accessed 24 November 1999.)
Weigand, E. (1998) Contrastive lexical semantics. In E. Weigand ed Contrastive Lexical Semantics. Amsterdam: Benjamins. 25-44.