Most of existing time series representations focus on processing of static collections of time series. Stream data processing however introduces several limitations not present (or not in such extent) in static data processing such as: limited memory, single pass through the data, requirement to guarantee result precision in the presence of inherent uncertainty of future data, concept drifting ... Very few of existing time series representations provide straightforward possibility to cope with these limitations as most of them require either multiple passes through the data or statistical information about the content of the whole data collection in the data transformation process.
In our work, we focused on stream data processing, namely on representation and processing of time series data produced in potentially infinite streams of data. We analysed commonly performed tasks and methods used to process streams of data and focused on time series representations and similarity measures as means to support other processing methods.
We identified an empty space between existing symbolic representations of time series. There are symbolic representations aggregating values in small time series windows into symbols representing intervals of mean values in these windows and there were attempts to transform whole shapes in course of time series into symbols. Representations using aggregated values as symbols does not directly reflect the time series shape and representations transforming shapes in course of time series suffer from several limitations such as inability to create the representation incrementally or meaningless symbols formed in the process.
We split the main focus of our work into three goals:
To address these goals, we proposed a solution composed from three parts:
Symbolic Time Series Representation
To allow the application of various methods from the domain of text processing in time series data analysis, we proposed a time series representation: Incremental Subsequence Clustering (ISC) representing the time series as sequence of symbols. Every symbols depicts a group of similar subsequences. The transformation is based on clustering of subsequences of defined length using an incremental clustering algorithm (algorithm Leader). Identifiers of created clusters are used as symbols to replace time series subsequences. We address the problem of incremental transformation by using an incremental, greedy clustering algorithm, not limiting the number of clusters, but rather their size.
Along with the new representation, we propose a similarity measure (SymD) working on the transformed data and we formally proved that it lower bounds the most commonly used time series distance measure - Euclidean distance.
Applications of the Time Series Representation
As the base idea of the symbolic representation is to represent repeating shapes as symbols, it is not applicable on all types of time series. The method provides the best results when seasonal data with multiple inherent patterns is processed. By extensive evaluation of the ISC representation and associated SymD distance measure on various types of short and long time series we showed its applicability in time series analysis. We showed the relevance of regarding the time series as sequences of symbols (repeating shapes) by applying the proposed symbolic representation in data analysis tasks such as classification of short and long time series and in forecasting. By comparison with other, non-symbolic representations, we showed it is able to achieve similar and even superior results and it is thus an approach worth to be taken into account when searching for a method to process the data. The symbolic approach is especially able, when seasonal data with multiple inherent patterns are processed.
By applying the representation on very long time series, we showed its applicability with regard for potentially infinite time series. We also demonstrated its applicability when processing multivariate time series and when symbol length is not defined by the seasonality in the data, but by some external factor. This broadens applicability of the representations on other types of data such as medical data, where symbol length is elastic and is controlled by another factors.
Processing of Evolving Data Streams
We payed special interest to processing very long and potentially infinite, evolving streams of data as most of existing time series representations and similarity measures are incapable of or struggle when processing such data. To cope with new symbols emergence in the course of the whole data stream, we proposed several approaches limiting the size of created alphabet of symbols. All these approaches exchange the limited memory footprint for increased reconstruction error of older sections of the data streams as we use the assumption that recent parts of the data or most frequently repeated sections of the data stream are more important and thus we need to represent them with greater precision.
All proposed alphabet size control approaches are based on a counter based frequent item mining algorithm, but are not limited for a single algorithm. Depending on the application at hand, one can choose from plethora of frequent item mining algorithms proposed over past decades.
We proposed a symbolic time series representation transforming the data incrementally into symbols. We proposed a similarity measure working on sequences of symbols representing repeating shapes in the course of time series and we demonstrated its applicability in tasks of classification and forecasting on various types of data. We proposed several approaches to limit the size of created alphabet of symbols by using frequent item mining algorithm to maintain low reconstruction error on recent and frequent symbols. The combination of incremental transformation process and limited alphabet size makes the representation applicable on processing of potentially infinite streams of data and on employing of various algorithms from text processing domain in time series processing domain.The thesis extended abstract is available in the Bulletin of the ACM Slovakia.