By Gerard Salton

Provides a concept of indexing able to rating index phrases, or topic identifiers in reducing order of significance. This ends up in the alternative of excellent rfile representations, and likewise bills for the position of words and of word list periods within the indexing technique.

This examine is standard of theoretical paintings in automated details association and retrieval, in that ideas are used from arithmetic, desktop technological know-how, and linguistics. a whole idea of details retrieval may perhaps emerge from a suitable blend of those 3 disciplines.

**Extra resources for A Theory of Indexing**

**Example text**

Let t be the total number of distinct terms assigned to the documents, n be the total number of documents, K be the average length of the document vectors (that is, the average number of nonzero terms), and K' be the average document frequency of a term (that is, the average number of documents to which a term is assigned). In increasing order of difficulty, the following computational requirements become necessary: for the weighting system based on collection or document frequencies (formulas (4) and (5)), K' additions are needed per term; for t terms, this produces K't additions.

A summarization of the complexity of the significance computations is given in Table 6. Since the discrimination value measure is dependent on the collection G. SALTON 26 TABLE 6 Computational complexity of significance computations Significance Overall order Computa tional requirements measure F or B (multiplications) K't additions EK (2K' + l)t (K1 + 2)t additions multiplications S/N (2K' + l)t 3K't 2K't additions multiplications logarithms o(3K't) (2Kn + 4» + 2)t + 2Kn + 2n multiplications (2Kn + n -f 3)t + 2Kn + n additions (n + \)t square roots o(2Knt) DV — o(K't) size, the calculations become automatically much more demanding than those required for the other measures.

The resulting thesaurus classes are not directly comparable to classes obtained by using only the low frequency terms for clustering purposes. However, the experimental recall-precision results may be close to those produced by the alternative, possibly preferred, methodology. A THEORY OF INDEXING 51 The document frequency cutoff actually used for deciding on inclusion of a given term in the experimental thesauruses was 19, 15, and 19 for the CRAN, MED, and Time collections respectively; that is, terms with document frequencies smaller than or equal to the stated frequencies were included.