Using Salient Words to Perform Categorization of Web Sites

Authors: M. Trabalka and M. Bieliková.

Reference: In Proc. of TSD 2000 - Text, Speech and Dialog. Petr Sojka, Ivan Kopeček and Karel Pala (Eds.). Springer Verlag, LNCS/LNAI 2448, Brno, Czech Republic, September 2002, pp. 65-72.

Abstract: In this paper we focus on web sites categorization. We compare some quantitative characteristics of existing web directories, analyze the vocabulary used in descriptions of the web sites in Yahoo web directory and propose an approach to automatically categorize web sites. Our approach is based on the novel concept of salient words. Two realizations of the proposed concept are experimentally evaluated. The former uses words typical for just one category, while the latter uses words typical for several categories. Results show that there is a limitation of using single vocabulary based method to properly categorize highly heterogeneous spaces as the World Wide Web.

PDF version (217K file, 8 pages) is available.

