Characteristics of Text

Most large collections of text documents have similar statistical characteristics. It is helpful to know about these statistics, because:

they influence the effectiveness and efficiency of data structures used to index documents, and
many retrieval models rely on them.

Patterns of Term Occurrences

If the terms in a collection are ranked (r) by their frequency (f), they roughly fit the relation r_t * f_t = C, which is known as "Zipf's law". Different collections have different constants C, but in English text, C tends to be about N / 10, where N is the number of words in the collection.

p_r = f_r / N is the probablility that a randomly chosen term (with frequency f_r) will have rank r.

r * p_r = A, where A tends to be about 0.1 in English text.

Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:

Word	f_t	r_t p_r*	Word	f_t	r_t p_r*	Word	f_t	r_t p_r*
the	15861	0.065	it	1290	0.095	week	793	0.113
of	7239	0.059	from	1228	0.095	they	697	0.102
to	6331	0.077	but	1138	0.093	govern	687	0.104
a	5878	0.096	u	955	0.082	all	672	0.104
and	5614	0.114	had	940	0.084	year	672	0.107
in	5294	0.129	last	930	0.087	its	620	0.101
that	2507	0.072	be	915	0.089	britain	89	0.098
for	2228	0.073	have	914	0.093	when	579	0.099
was	2149	0.079	who	894	0.095	out	577	0.101
with	1839	0.075	not	882	0.097	would	577	0.103
his	1815	0.081	has	880	0.100	new	572	0.105
is	1810	0.089	an	873	0.103	up	559	0.105
he	1700	0.090	s	865	0.106	been	554	0.106
as	1581	0.090	were	848	0.107	more	540	0.106
on	1551	0.095	their	815	0.106	which	539	0.108
by	1467	0.096	are	812	0.109	into	518	0.106
at	1333	0.092	one	811	0.112

Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:

Word	f_t	r_t p_r*	Word	f_t	r_t p_r*	Word	f_t	r_t p_r*
the	1130021	0.059	from	96900	0.092	or	54958	0.101
of	547311	0.058	he	94585	0.095	about	53713	0.102
to	516635	0.082	million	3515	0.098	market	52110	0.101
a	464736	0.098	year	90104	0.100	they	51359	0.103
in	390819	0.103	its	86774	0.100	this	50933	0.105
and	387703	0.122	be	85588	0.104	would	50828	0.107
that	204351	0.075	was	83398	0.105	u	49281	0.106
for	199340	0.084	company	3070	0.109	which	48273	0.107
is	152483	0.072	an	76974	0.105	bank	47940	0.109
said	148302	0.078	has	74405	0.106	stock	47401	0.110
it	134323	0.078	are	74097	0.109	trade	47310	0.112
on	121173	0.077	have	73132	0.112	his	47116	0.114
by	118863	0.081	but	71887	0.114	more	46244	0.114
as	109135	0.080	will	71494	0.117	who	42142	0.106
at	101779	0.080	say	66807	0.113	one	41635	0.107
mr	101679	0.086	new	64456	0.112	their	40910	0.108
with	101210	0.091	share	63925	0.114

Zipf's law implies that a term with f_t occurrences has rank approximately A * N / f_t.

Often, several terms will have the same frequency. If the rank r_n is assigned to the last term of a group sharing the same frequency, then there are r_f terms that occur at least f times, and r_{f+1} terms that occur more than f+1 times. The number of terms that occur f times is therefore: I_f = r_f - r_{f+1} = AN/f - AN/(f+1) = AN/(f(f+1))

How many unique terms are in the collection? If there is at least one term that occurs only once, then by Zipf's law, AN / 1.

How many terms occur just once in the collection? AN / 2.
How many terms occur just twice in the collection? AN / 6 = 17%.
How many terms occur just three times? 8.3%.

Bibliography

Heaps, H. S. Information Retrieval: Computational and Theoretical Aspects. 1978.