Characteristics of Text

Most large collections of text documents have similar statistical characteristics. It is helpful to know about these statistics, because:

Patterns of Term Occurrences

If the terms in a collection are ranked (r) by their frequency (f), they roughly fit the relation r_t * f_t = C, which is known as "Zipf's law". Different collections have different constants C, but in English text, C tends to be about N / 10, where N is the number of words in the collection.

p_r = f_r / N is the probablility that a randomly chosen term (with frequency f_r) will have rank r.

r * p_r = A, where A tends to be about 0.1 in English text.

Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:

Word f_t r_t * p_r Word f_t r_t * p_r Word f_t r_t * p_r
the 15861 0.065 it 1290 0.095 week 793 0.113
of 7239 0.059 from 1228 0.095 they 697 0.102
to 6331 0.077 but 1138 0.093 govern 687 0.104
a 5878 0.096 u 955 0.082 all 672 0.104
and 5614 0.114 had 940 0.084 year 672 0.107
in 5294 0.129 last 930 0.087 its 620 0.101
that 2507 0.072 be 915 0.089 britain 89 0.098
for 2228 0.073 have 914 0.093 when 579 0.099
was 2149 0.079 who 894 0.095 out 577 0.101
with 1839 0.075 not 882 0.097 would 577 0.103
his 1815 0.081 has 880 0.100 new 572 0.105
is 1810 0.089 an 873 0.103 up 559 0.105
he 1700 0.090 s 865 0.106 been 554 0.106
as 1581 0.090 were 848 0.107 more 540 0.106
on 1551 0.095 their 815 0.106 which 539 0.108
by 1467 0.096 are 812 0.109 into 518 0.106
at 1333 0.092 one 811 0.112

Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:

Word f_t r_t * p_r Word f_t r_t * p_r Word f_t r_t * p_r
the 1130021 0.059 from 96900 0.092 or 54958 0.101
of 547311 0.058 he 94585 0.095 about 53713 0.102
to 516635 0.082 million 3515 0.098 market 52110 0.101
a 464736 0.098 year 90104 0.100 they 51359 0.103
in 390819 0.103 its 86774 0.100 this 50933 0.105
and 387703 0.122 be 85588 0.104 would 50828 0.107
that 204351 0.075 was 83398 0.105 u 49281 0.106
for 199340 0.084 company 3070 0.109 which 48273 0.107
is 152483 0.072 an 76974 0.105 bank 47940 0.109
said 148302 0.078 has 74405 0.106 stock 47401 0.110
it 134323 0.078 are 74097 0.109 trade 47310 0.112
on 121173 0.077 have 73132 0.112 his 47116 0.114
by 118863 0.081 but 71887 0.114 more 46244 0.114
as 109135 0.080 will 71494 0.117 who 42142 0.106
at 101779 0.080 say 66807 0.113 one 41635 0.107
mr 101679 0.086 new 64456 0.112 their 40910 0.108
with 101210 0.091 share 63925 0.114

Zipf's law implies that a term with f_t occurrences has rank approximately A * N / f_t.

Often, several terms will have the same frequency. If the rank r_n is assigned to the last term of a group sharing the same frequency, then there are r_f terms that occur at least f times, and r_{f+1} terms that occur more than f+1 times. The number of terms that occur f times is therefore: I_f = r_f - r_{f+1} = AN/f - AN/(f+1) = AN/(f(f+1))

How many unique terms are in the collection? If there is at least one term that occurs only once, then by Zipf's law, AN / 1.

How many terms occur just once in the collection? AN / 2.
How many terms occur just twice in the collection? AN / 6 = 17%.
How many terms occur just three times? 8.3%.

Bibliography


© 1997, Jamie Callan . All rights reserved.
Last updated February 6, 1997