We study the problem of succinctly summarizing a database of event sequences in terms of generalized sequential patterns. That is, we are interested in patterns that are not exclusively defined over observed surface-level events, as is usual, but rather may additionally include generalized events that can match a set of events. To avoid spurious and redundant results we define the problem in terms of the Minimum Description Length principle, by which we are after that set of patterns and generalizations that together best compress the data without loss. The resulting optimization problem does not lend itself for exact search, which is why we propose the heuristic Flock algorithm to efficiently find high-quality models in practice. Extensive experiments on synthetic and real-world data show that Flock results in compact and easily interpretable models that accurately recover the ground truth, including rare instances of generalized patterns. Additionally Flock recovers how generalized events within patterns depend on each other, and overall provides clearer insight into the data-generating process than using state of the art algorithms that only consider surface-level patterns.
ACM International Conference on Knowledge Discovery and Data Mining (KDD)
2023-08-06
2024-09-11