DATAKON 2001

Sponzori konferencie
DATAKON 2001
ADASTRA, s.r.o.
AEC, s.r.o.
BERIT, a.s.
DCIT, s.r.o.
IBM Eeská republika, s.r.o.
Oracle Czech, s.r.o.
Progress Software, s.r.o.
Qbizm Technologies, Inc.
SMS, s.r.o.
SYBASE Eeská republika, s.r.o.
TATRA Banka, a.s.
TurboConsult, s.r.o.
Vema, a.s.

Mediálny partner
Softwarové noviny

Analysis of text on WWW pages using important information indicators

Vojtěch Svátek, Petr Strossa a Martin Kavalec

Department of Information and Knowledge Engineering, University of Economics
Prague, nám. W. Churchilla 4, 13067 Praha 3, Czech Republic
svatek@vse.cz, kizips@vse.cz, kavalec@vse.cz

Abstract. We examine the possibility of indexing web pages using collections of words indicating the most important places in the text. Two ways of constructing such collections are investigated. The first way relies on an intellectual analysis of the domain in question; a collection of 'indicators' originally developed for automated summarisation of English texts has been adapted for the analysis of Czech web pages of commercial companies. The second way is based on learning the 'indicators' from text corpora; to alleviate the burden of indexing the training data manually, an original method has been devised that reuses the previous work of the indexers of public web directories.

Keywords: information extraction, data mining, WWW directories.

<< Obsah