Summary: This paper studies the complexity of the Cartesian Tree Subsequence Matching problem (CTMSeq), an extension of the Cartesian Tree Matching problem (CTM) by Park et al. [TCS 2020] from contiguous substrings to non-contiguous subsequences. As a main result, we present an efficient algorithm for CTMSeq that runs in O(mn loglog n) time and O(n log m)-space.
From the veiw of prediction multiplicities of models, this paper studies efficient enumeration of all "good" rule lists whose training errors are optimum within some tolerance. The collection of such good rules is also known as the "Rashomon set" in machine learning literatures. We presented an efficient exact algorithm for the task, and conducted a number of experiments on the COMPAS data set to analyze prediction multiplicities and fairness of the obtained models.
In memory of Dr. Dany Breslauer: The above paper is the revised version of the earlier paper "Fully-online construction of suffix trees for multiple texts" by Takuya Takagi, Shunsuke Inenaga, and Hiroki Arimura presented in CPM2016, Tel Aviv in June 2016. During the conference, Dany kindly gave us a number of comments on how to improve the fully-only maintenance mechanism for suffix tree labels, and then we started the joint work to improve the paper with him. Later Dr. Diptarama Hendrian joined us.
After some e-mail communication, Dany gently visited us in Hokkaido University, Sapporo, on his way of private wandering journey in Japan for months in 2017. During his stay in 7-9 June at our graduate school, he gave a talk in our department seminar, and had collaboration with us. During his stay, we enjoyed the welcome 'Genghs Khan' (Hokkaido style BBQ) party at the famous 'Sapporo Beer Garden' restaurant booked by his request. We will always miss him, Dany-san. (Hiroki Arimura, created on 20 Oct 2019; Corrected some typos on 06 April 2022)
See: "Danny Breslauer, June 20, 1968 December 19, 2017" (a memorial booklet/collection decicated to Danny Breslauer by his friends in PWA2018 homepage at math.unipa.it editted by CPM/SPIRE-ML members)
The program LCM won the FIMI'04 Best Implementation Award on November 1st, 2004 in Proc. 2nd Workshop on Frequent Itemset Mining Implementations. The current version of LCM algorithm is Ver.5.3 at the time of October 2010, and is available at [Takeaki Uno's Program Code site].
The survey paper on the top ten data mining algorithms (Wu et al., KIS, 14, 2007, IEEE ICDM'06) selected the frequent itemset mining algorithm (Apriori algorithm) as one of the top ten algorithms (The rest of the algorithms are: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART). In the related chapter, the authors mentioned that the LCM algorithm is "the most efficient algorithm to find the closed itemsets."
- X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, "Top 10 algorithms in data mining," Chapter 4, The Apriori algorithm, Knowledge and Information Systems, 14(1), 1-37, 2007..
Some comments: The above paper (SDM'02) presented an efficient method for systematically generating all (labeled) ordered trees from smaller to larger in constant time per tree, independently of Zaki (2002) and Nakano (2002). The method is implemented in tree mining algorithms,
Freqtby the authors (SDM'02, PKDD'02) and TreeMinerby Zaki (KDD'02, TKDE'05). Our SDM'02 paper proposed the enumeration scheme, while the search scheme adopted was the BFS (breadth-first search) at first. Then, the successive PKDD'02 paper proposed the DFS (depth-first search) and gave an experimental comparison to the BFS scheme. This algorithm is extended for frequent ordered tree mining from XML data streams in IEEE ICDM'02 paper by Asai, Arimura, et al. At the moment of October 2010, this paper is cited by approx. 340 papers according to [Google Scholar].
- An implementation in C++ and Ruby, by Dr. Taku Kudo (Computational Linguistics Laboratory, NAIST at that time), is available at [download site]
- An implementation in Java by the authors is available at [Takeaki Uno's Program Code site].
- An application of.
Freqtin NLP mining is reported in (Morinaga, Arimura, Ikeda, Sakao, Akamine, KDD'05) [ACM DL].
- Mohammed J. Zaki, Efficiently mining frequent trees in a forest, KDD'02, ACM, 71-80, 2002. (Also appeared in: IEEE Trans. Knowl. Data Eng. 17(8), 1021-1035, 2005)
- Shin-Ichi Nakano, Efficient generation of plane trees, Information Processing Letters, 84(3), 167-172, 2002.
- Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa, Efficient Substructure Discovery from Large Semi-structured Data, SDM'02, 158-174, SIAM, 2002. (Also in IEICE Trans. on Inf. and Syst., Vol.E87-D, No.12, 2754-2763, 2004.)
- Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, PKDD'02, LNAI 2431, Springer, 1-14, 2002..
This paper presented a simple linear-time algorithm for computing the LCP array from a suffix array and its underlying text. As a consequence, we can construct the suffix tree in linear time from the suffix array for a given text as mentioned in the survey paper by Apostolico et al. (CACM 2016). This fact raised some interest in interaction between LCP and Suffix arrays, and its extensions to low space and external memory computation and to more complex data. At the moment of April 2022, this paper is cited by approx. 100 papers in [ACM Digital Library] and by approx. 640 papers according to [Google Scholar].
Abouelhoda, Kurtz, and Ohlebusch (JDA, 2004) invented the lcp-interval trees by nicely extending the bottom-up simulation technique in this paper, and then proposed a new data structure, called ESA (Enhanced Suffix Arrays). The ESA data structure enables to perform top-down search and traversal of suffix links as well as bottom-up search on suffix array data structure (suffix table) combined with additional tables , i.e., lcp table, child table, and suffix-link table (Abouelhoda, Kurtz, Ohlebusch, JDA, 2004). ESA is used in the first version of REPuter program, a software tool for computing exact repeats and palindromes in entire genomes efficiently (Kurtz, Schleiermacher, Bioinformatics, 15, 1999).
- Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, S. Muthukrishnan, "40 Years of Suffix Trees," Communications of the ACM, April 2016, Vol. 59 No. 4, Pages 66-73 
- Mohamed Ibrahim Abouelhoda, Stefan Kurtz , Enno Ohlebusch, "Replacing suffix trees with enhanced suffix arrays," Journal of Discrete Algorithms, 2(1), 53-86, March 2004.
- Stefan Kurtz and Chris Schleiermacher, "REPuter: fast computation of maximal repeats in complete genomes," Bioinformatics, 15(5), 426-427, 1999..