Metadata Management for Large Information Spaces

Karol Rástočný

Doctoral thesis project supervised by prof. Mária Bieliková

Motivation and Goals

Due to size and heterogeneity of large information spaces, methods of data processing use metadata as their main source for their tasks. Usage of already created metadata decreases necessity of raw data preprocessing and it increases efficiency of data processing methods. But large information spaces as the Web or especially source code repositories are not stable. Information stored in them are modified continuously. These modifications affect quality of metadata.

In our work we challenge the problem of metadata management for large information spaces, while we focus to three main goals:

  • proposition of a metadata model suitable for information exchange and efficient metadata maintenance;
  • proposition of scalable metadata repository which respects characteristics the metadata model;
  • approach to medatada maintenance which keep metadata valid and consistent.


In the dissertation thesis we discussed problems of metadata management for large information spaces. As main problem we identify invalidation of medata caused by instability of information spaces. A solution of this problem we split to four main contributions of the thesis:

  • Information tags - we proposed novel representation of descriptive metadata, that is natural for systems. This representation is independent from described resources what fulfils initial requirement of effective metadata maintenance.
  • Information tags repository - to utilize contributions of information tags, we proposed information tags repository based on MongoDB that stores information tags in the model based on the standardized Open Annotation Model. This in combination of proposed SPARQL query processing algorithm guarantees integration possibility with existing systems.
  • Robust descriptor for source code - information tags reference tagged information artifact via robust descriptors. For this reason we proposed robust location descriptor and its interpreting algorithm, which is able to identify tagged source code artifact in real-time.
  • Stream-based metadata maintenance - to keep information tag space valid and consistent we proposed method for creating, updating and removing information tags based on querying steam of events about users and systems actions over information spaces. The method executed necessary maintenance actions after receiving results from stream queries.

We evaluate contributions of our work in the domain of the project PerConIK (Personalized Conveying of Information and Knowledge). In the project we used proposed methods for management of metadata about source codes. We deploy implemented repository as main information store in the PerConIK architecture. The tagger continuously processed stream of developers' activities in IDEs (Microsoft Visual Studio and Eclipse) and web browsers and source code changes from git repositories.

In addition we proposed set of information tag types for developers, that can be used by developers for manual tagging of source code. These tags are proposed mainly for reviewing source code. To support this process we develop system CodeReview, that has been used in school course Team project for two years.

The thesis extended abstract is available in the Bulletin of the ACM Slovakia.

Selected publications

Rástočný, K., Tvarožek, M., Bieliková, M.
Web Search Results Exploration via Cluster-Based Views and Zoom-Based Navigation. Journal of Universal Computer Science, Vol. 19, Issue 16, pp. 2320-2346 (2013)
Rástočný, K., Tvarožek, M., Bieliková, M.
Supporting Search Result Browsing and Exploration via Cluster-based Views and Zoom-based navigation. In Proc. of 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops. IEEE Press, pp. 297-300 (2011)
Rástočný, K., Bieliková, M.
Maintenance of Knowledge Tags within Heterogeneous Web Content. In LNCS 7703, Proc. of Current Trends in Web Engineering: ICWE 2012 International Workshops MDWE, Composable Web, WeRE, QWE, and Doctoral Consortium. Springer, pp. 216-220 (2012)
Bieliková, M., Rástočný, K.
Lightweight Semantics over Web Information Systems Content Employing Knowledge Tags. In LNCS 7518, Proc. of ER Workshops 2012. Springer, pp. 327-336 (2012)
Rástočný, K., Bieliková, M.
Metadata Anchoring for Source Code: Robust Location Descriptor Definition, Building and Interpreting. In LNCS 8056, Proc. of DEXA 2013. Springer, pp. 372-379 (2013)
Bieliková, M., Polášek, I., Barla, M., Kuric, E., Rástočný, K., Tvarožek, J., Lacko, P.
Platform Independent Software Development Monitoring: Design of an Architecture. In LNCS 8327, Proc. of SOFSEM 2014. Springer, pp. 327-336 (2014)
Rástočný, K., Bieliková, M.
Enriching Source Code by Empirical Metadata. In Proc. of ESEM 2014: 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, p. 1 (2014)
Rástočný, K., Bieliková, M.
Empirical Metadata Maintenance in Source Code Development Process. In Proc. of ECBS-EERC 2015: 2015 IEEE Fourth Eastern European Regional Conference on the Engineering of Computer Based Systems. CS IEEE Press, pp. 25-31 (2015)

to Homepage to Teaching to the Top

Last updated:
Mária Bieliková bielik [zavináč] fiit-dot-stuba-dot-sk
Design © 2oo1 KoXo