Automatic Web Content Annotation

Jakub Ševcech

Bachelor thesis project supervised by prof. Mária Bieliková


Motivation

Everyone sometimes encounters a problem, that while studying or just reading some text, reader gets to the point, where he finds out that he needs more information or requires the text to be better explained. Such a place could be a term, that he does not understand and therefore needs some kind of explanation. It may be also a concept, about which reader would like to find out something more than just information mentioned directly in the text. These concepts are usually matching with a particular word in the text, so reader needs additional information to be directly assigned to this word.

We propose a method to automatically enrich keywords in text with definitions or links to related pages. This method is proposed for web pages in Slovak, but can be used for more languages with similar structure.

Method

Method for annotation creation consists of four steps:

  1. Elimination of redundant parts of page and selection of text to be annotated
  2. Search for candidate words to assign the annotations
  3. Search for information to fill the annotation
  4. Adaptation and visualization of annotations

Candidate words to assign annotations are extracted from web page text translated to English. To map extracted words to words in original text we proposed and verified a method for word mapping between text and its translation.

To search for information to fill the annotations we use various publicly available services for information retrieval, such as Google Search or SlideShare.

We proposed method for adaptation of annotations in form of reordering of list of links to related web pages. Presented order of links is based on implicit feedback from user's interaction with annotations.

Evaluation

We evaluated proposed methods in multiple closed experiments and in open experiment in learning system ALEF. In proposed method for mapping equivalent words between text and its translation, more than 90% of mappings were correct, but only 45% of all words in text were mapped. To increase the number of mapped words we implemented two enhancements of proposed method. First enhancement take into account positions of words in phrases. The second use dictionary transformed in the way, that all English words are stemmed using Porter algorithm.

We evaluated quality of information gathered through publicly available services and we found that there are big differences in quality of information provided by different services. The quality of found information and thus quality of created annotations greatly depends on used services.

In learning system ALEF, we evaluated the method for annotation adaptation. We found that created oder of links is always better than the random order, but the order wasn't perfect. We believe that with more training data this order could be even better.

Publications

Ševcech, J.
Web-page annotation. Bachelor thesis, Slovak University of Technology in Bratislava 2011. 49p. pdf (in Slovak)

Ševcech, J.
Automatic Web Content Annotation. In Proc. in Informatics and Information Technology Student Research Conference IIT.SRC 2011.

  Project web page

to Homepage to Teaching to the Top

Home
Research
Projects
Publications
Books
SCM
Teaching
Links
Last updated:
Mária Bieliková bielik [zavináe] fiit-dot-stuba-dot-sk
Design © 2oo1 KoXo