p_r = f_r / N is the probablility that a randomly chosen term (with frequency f_r) will have rank r.
r * p_r = A, where A tends to be about 0.1 in English text.
Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:
Word | f_t | r_t * p_r | Word | f_t | r_t * p_r | Word | f_t | r_t * p_r |
---|---|---|---|---|---|---|---|---|
the | 15861 | 0.065 | it | 1290 | 0.095 | week | 793 | 0.113 |
of | 7239 | 0.059 | from | 1228 | 0.095 | they | 697 | 0.102 |
to | 6331 | 0.077 | but | 1138 | 0.093 | govern | 687 | 0.104 |
a | 5878 | 0.096 | u | 955 | 0.082 | all | 672 | 0.104 |
and | 5614 | 0.114 | had | 940 | 0.084 | year | 672 | 0.107 |
in | 5294 | 0.129 | last | 930 | 0.087 | its | 620 | 0.101 |
that | 2507 | 0.072 | be | 915 | 0.089 | britain | 89 | 0.098 |
for | 2228 | 0.073 | have | 914 | 0.093 | when | 579 | 0.099 |
was | 2149 | 0.079 | who | 894 | 0.095 | out | 577 | 0.101 |
with | 1839 | 0.075 | not | 882 | 0.097 | would | 577 | 0.103 |
his | 1815 | 0.081 | has | 880 | 0.100 | new | 572 | 0.105 |
is | 1810 | 0.089 | an | 873 | 0.103 | up | 559 | 0.105 |
he | 1700 | 0.090 | s | 865 | 0.106 | been | 554 | 0.106 |
as | 1581 | 0.090 | were | 848 | 0.107 | more | 540 | 0.106 |
on | 1551 | 0.095 | their | 815 | 0.106 | which | 539 | 0.108 |
by | 1467 | 0.096 | are | 812 | 0.109 | into | 518 | 0.106 |
at | 1333 | 0.092 | one | 811 | 0.112 |
Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:
Word | f_t | r_t * p_r | Word | f_t | r_t * p_r | Word | f_t | r_t * p_r |
---|---|---|---|---|---|---|---|---|
the | 1130021 | 0.059 | from | 96900 | 0.092 | or | 54958 | 0.101 |
of | 547311 | 0.058 | he | 94585 | 0.095 | about | 53713 | 0.102 |
to | 516635 | 0.082 | million | 3515 | 0.098 | market | 52110 | 0.101 |
a | 464736 | 0.098 | year | 90104 | 0.100 | they | 51359 | 0.103 |
in | 390819 | 0.103 | its | 86774 | 0.100 | this | 50933 | 0.105 |
and | 387703 | 0.122 | be | 85588 | 0.104 | would | 50828 | 0.107 |
that | 204351 | 0.075 | was | 83398 | 0.105 | u | 49281 | 0.106 |
for | 199340 | 0.084 | company | 3070 | 0.109 | which | 48273 | 0.107 |
is | 152483 | 0.072 | an | 76974 | 0.105 | bank | 47940 | 0.109 |
said | 148302 | 0.078 | has | 74405 | 0.106 | stock | 47401 | 0.110 |
it | 134323 | 0.078 | are | 74097 | 0.109 | trade | 47310 | 0.112 |
on | 121173 | 0.077 | have | 73132 | 0.112 | his | 47116 | 0.114 |
by | 118863 | 0.081 | but | 71887 | 0.114 | more | 46244 | 0.114 |
as | 109135 | 0.080 | will | 71494 | 0.117 | who | 42142 | 0.106 |
at | 101779 | 0.080 | say | 66807 | 0.113 | one | 41635 | 0.107 |
mr | 101679 | 0.086 | new | 64456 | 0.112 | their | 40910 | 0.108 |
with | 101210 | 0.091 | share | 63925 | 0.114 |
Zipf's law implies that a term with f_t occurrences has rank approximately A * N / f_t.
Often, several terms will have the same frequency. If the rank r_n is assigned to the last term of a group sharing the same frequency, then there are r_f terms that occur at least f times, and r_{f+1} terms that occur more than f+1 times. The number of terms that occur f times is therefore: I_f = r_f - r_{f+1} = AN/f - AN/(f+1) = AN/(f(f+1))
How many unique terms are in the collection? If there is at least one term that occurs only once, then by Zipf's law, AN / 1.
How many terms occur just once in the collection? AN / 2.
How many terms occur just twice in the collection? AN / 6 = 17%.
How many terms occur just three times? 8.3%.