Invited Speakers
Alistair Moffat
Alistair Moffat has been involved in research in text compression and information retrieval for more than three decades, and has published numerous papers in the areas of index compression, text compression, and dynamic pruning mechanisms, all of which help support efficient ranked querying. Alistair is a co-author of the 1994 (revised 1999) book Managing Gigabytes, and also co-author of the 2002 book Compression and Coding Algorithms. Much of Alistair’s recent work has examined the issue of IR system evaluation, and, with other co-authors in Australia, he has focused on the relationship between models of user interactions with search results pages, and the effectiveness metrics that those interactions correspond to. Alistair was co-Chair for SIGIR 1998 in Melbourne, and for CIKM 2015, also held in Melbourne; and co-Program Committee Chair for SIGIR 2005 (Salvador, Brazil) and SIGIR 2015 (Santiago, Chile). WSDM, held in Melbourne in February 2019, was Alistair’s most recent conference project.
Alistair has been a teaching/research faculty member at the University of Melbourne since 1986, and was Department Chair from 2007-2011. During his career at Melbourne he has taught programming skills to well in excess of 15,000 undergraduate students, has authored a C programming textbook (Programming, Problem Solving, and Abstraction with C, 2002, revised 2012), and has received awards for his teaching and lecturing skills.
Alistair’s PhD was completed in 1985, at the University of Canterbury, in New Zealand, in the area of shortest path algorithms.
Gonzalo Navarro
Gonzalo Navarro completed his PhD in Computer Science in 1998 at the University of Chile, where he is currently full professor. His areas of interest include algorithms and data structures, text searching, compression, and metric space searching.
He has directed the Millennium Nucleus Center for Web Research, RIBIDI (an Ibero American project funded by CYTED), and a project funded by Yahoo! Research, apart from smaller projects. He has participated in various research projects, such as the Millennium Institute for Cell Dynamics and Biotechnology, an ECOS/CONICYT project (Chile-France cooperation), AMYRI (a CYTED project), and a Fondef project. Currently, he participates in the Millennium Nucleus Information and Coordination in Networks and in the Center for Biotechnology and Bioengineering.
He has been PC (co-)chair of several conferences: SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR 2005 Posters, IFIP TCS 2006, a track of ENC 2007, SISAP 2008, SISAP 2012, LATIN 2016, SPIRE 2018, and CPM 2018. He co-created SISAP on 2008, and was Steering Committee member of SPIRE, LATIN, and SISAP. He is a member of the Editorial Board of the journals Information Retrieval, ACM Journal of Experimental Algorithmics, and Information Systems. He has been guest editor of special issues in ACM SIGSPATIAL, Journal of Discrete Algorithmics, Information Systems, and Algorithmica. He has been PC member of more than 50 international conferences and reviewer for about 40 international journals. He has given around 50 invited talks in several universities and international conferences, including 12 plenary talks and 5 tutorials in international conferences. He created in 2005 the Workshop on Compression, Text, and Algorithms, which has become a permanent satellite of SPIRE.
He has coauthored two books published by Cambridge University Press, about 25 book chapters, 8 proceedings of international conferences (editor), more than 150 papers in international journals, and over 230 in international conferences. He is one of the most prolific and highly cited authors in Latin America.
He has advised 6 postdocs, 18 PhDs, 15 MScs, and 16 undergraduate theses.
Veli Mäkinen
Veli Mäkinen is Professor of Computer Science at the University of Helsinki. He heads the Genome-scale algorithmics research group and is the Director of the Master’s Programme in Computer Science. He has adviced 8 postdocs and supervised 4 PhD theses and 36 Master’s theses. He is currently supervising three PhD theses.
Veli Mäkinen started his career in string algorithms and compressed data structures. As of 2019, he has over 100 publications on these and related topics, with the focus being shifted towards algorithmic bioinformatics, where different high-throughput sequencing data analysis scenarios make near-linear time algorithms that work in small space an appealing target of study. Inspired by this new angle to bioinformatics algorithms, Veli Mäkinen co-authored a textbook on Genome-scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing, published by Cambridge University Press, 2015.
The new textbook works as the main material for two algorithmic bioinformatics courses taught frequently by Veli Mäkinen. He also teaches the general algorithms courses Design and Analysis of Algorithms, String Processing Algorithms, Data Compression Techniques, and projects and seminars related to these topics.
Some current research interests include studying algorithms and computational complexity when moving from sequences to variation graphs in (pan-)genomics, and studying applications of different index structures related to the variants of the Burrows-Wheeler transform.
Besides research and teaching, Veli Mäkinen has an active role in the research community. Some recent activities include co-chairing IWOCA 2016 and guest editing a special issue for Theory of Computing Systems, program committee memberships (WABI 2013-2018, MFCS 2016, RECOMB 2017-2019, SPIRE 2018, CPM 2018-2019), and a keynote talk at the ECCB 2016 workshop on Computational Pan-Genomics.
Programme
All papers accepted in SPIRE 2019 have been allocated a time slot for oral presentation of 20min and 10min respectively for regular and short papers. (Printable Programme in PDF here)
The venue is Campus Maria Zambrano that includes the school of Informatics. All talks will be presented at room G.121 (first floor) with the exception of the opening session that will be held on the main hall («ágora») within the building.
Monday, October 7th
|
09:00 |
Registration |
09:30 |
Opening |
10:00 – 11:00 |
S1 : Data Compression (chair: Rossano Venturini) |
[Regul] Approximation ratios of RePair, LongestMatch and Greedy on unary strings. |
Danny Hucke. |
[Regul] Lossless Image Compression Using List Update Algorithms. |
Arezoo Abdollahi, Neil Bruce, Shahin Kamali and Rezaul Karim. |
[Short] Rpair: Scaling up RePair with Rsync. |
Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto and Yoshimasa Takabatake. |
11:00 |
Coffee |
11:30 – 12:30 |
S2 : Information Retrieval (chair: Ricardo Baeza-Yates) |
[Regul] Position Bias Estimation for Unbiased Learning-to-Rank in eCommerce Search. |
Grigor Aslanyan and Utkarsh Porwal. |
[Short] An Optimal Algorithm to Find Champions of Tournament Graphs. |
Lorenzo Beretta, Franco Maria Nardini, Roberto Trani and Rossano Venturini. |
[Short] Network-Based Pooling for Topic Modeling on Microblog Content. |
Anais Ollagnier and Hywel Williams. |
12:30 – 13:30 |
Invited Talk 1 :♦: Alistair Moffat (AU) :♦: C/W/L Spells «Cool»: User-Based Evaluation in Information Retrieval |
13:30 – 15:30 |
Lunch (@ Hotel San Antonio El Real.) |
15:30 – 17:00 |
S3 : String Algorithms (chair: Gabriele Fici) |
[Regul] Bounds and Estimates on the Average Edit Distance. |
Michele Schimd and Gianfranco Bilardi. |
[Regul] Compact Data Structures for Shortest Unique Substring Queries. |
Takuya Mieno, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda. |
[Regul] Fast Cartesian Tree Matching. |
Siwoo Song, Cheol Ryu, Simone Faro, Thierry Lecroq and Kunsoo Park. |
[Regul] Inducing the Lyndon Array. |
Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino and Guilherme P. Telles. |
17:00 – 21:00 |
GUIDED VISIT: Segovia and the Alcázar |
21:00 |
Monday Dinner (@ Restaurante Casares.) |
Tuesday, October 8th
|
08:50 |
Registration |
09:00 – 11:00 |
S4 : String Algorithms (chair: Travis Gagie) |
[Regul] On Longest Common Property Preserved Substring Queries. |
Kazuki Kai, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda and Tomasz Kociumaka.
|
[Regul] Online Algorithms on Antipowers and Antiperiods. |
Mai Alzamel, Alessio Conte, Daniele Greco, Veronica Guerrini, Costas Iliopoulos, Nadia Pisanti, Nicola Prezza, Giulia Punzi and Giovanna Rosone. |
[Regul] Polynomial-Delay Enumeration of Maximal Common Subsequences. |
Alessio Conte, Roberto Grossi, Giulia Punzi and Takeaki Uno. |
[Regul] Searching Runs in Streams. |
Oleg Merkurev and Arseny Shur. |
[Regul] Weighted Shortest Common Supersequence Problem Revisited. |
Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń and Wiktor Zuba. |
11:00 |
Coffee |
11:30 – 12:30 |
S5 : Algorithms (chair: Giovanni Manzini) |
[Regul] Fast Identification of Heavy Hitters by Cached and Packed Group Testing. |
Yusaku Kaneta, Takeaki Uno and Hiroki Arimura. |
[Short] Range Shortest Unique Substring Queries. |
Paniz Abedin, Arnab Ganguly, Solon Pissis and Sharma Thankachan. |
[Short] Minimal Absent Words in Rooted and Unrooted Trees. |
Gabriele Fici and Pawel Gawrychowski. |
[Short] A new Linear-time Algorithm for Centroid Decomposition. |
Davide Della Giustina, Nicola Prezza, and Rossano Venturini. |
12:30 – 13:30 |
Invited Talk 2 :♦: Veli Mäkinen (FIN) :♦: When Stringology Meets Graphs |
13:30 – 15:30 |
Lunch (@ Hotel San Antonio El Real.) |
15:30 – 17:00 |
S6 : Computational Biology (chair: Solon P. Pissis) |
[Regul] COBS: a Compact Bit-Sliced Signature Index. |
Timo Bingmann, Phelim Bradley, Florian Gauger and Zamin Iqbal. |
[Regul] An Index for Sequencing Reads Based on The Colored de Bruijn Graph. |
Diego Diaz. |
[Regul] Linear Time Maximum Segmentation problems in Column Stream Model. |
Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen, and Tuukka Norri. |
[Regul] Space-Efficient Merging of Succinct de Bruijn Graphs. |
Lavinia Egidi, Felipe A. Louza and Giovanni Manzini. |
17:00 |
Coffee |
17:30 – 18:30 |
Business Meeting |
|
|
21:00 – 23:30 |
Conference banquet (@ Restaurante – Asador El Hidalgo.) |
Wednesday, October 9th
|
09:20 |
Registration |
09:30 – 11:00 |
S7 : Indexing and Compression (chair: Kunsoo Park) |
[Regul] Run-Length Encoding in a Finite Universe. |
N. Jesper Larsson. |
[Short] On the computation of longest previous non-overlapping factors. |
Enno Ohlebusch and Pascal Weber. |
[Regul] Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets. |
Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda. |
[Regul] Parallel External Memory Wavelet Tree and Wavelet Matrix Construction. |
Jonas Ellert and Florian Kurpicz. |
[Short] SACABench: Benchmarking Suffix Array Construction. |
Johannes Bahne, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, Florian Kurpicz, Marvin Löbel, Oliver Magiera, Rosa Pink, David Piper and Christopher Poeplau. |
11:00 |
Coffee |
11:30 – 12:30 |
S8 : Compressed Data Structures (chair: José R. Paramá) |
[Regul] Faster Dynamic Compressed $d$-ary Relations. |
Diego Arroyuelo, Guillermo De Bernardo, Travis Gagie and Gonzalo Navarro. |
[Regul] Faster Repetition-Aware Compressed Suffix Trees based on Block Trees. |
Manuel Cáceres and Gonzalo Navarro. |
[Regul] A Practical Alphabet-Partitioning Rank/Select Data Structure. |
Diego Arroyuelo and Erick Sepúlveda. |
12:30 – 13:30 |
Invited Talk 3 :♦: Gonzalo Navarro (CHL) :♦: Repetitiveness and Indexability |
13:30 – 15:30 |
Lunch (@ Hotel San Antonio El Real.) |
15:30 – 17:00 |
S9 : Compressed Data Structures (chair: Nicola Prezza) |
[Regul] Adaptive Succinctness. |
Diego Arroyuelo and Rajeev Raman. |
[Regul] Fast, Small, and Simple Document Listing on Repetitive Text Collections. |
Dustin Cobas and Gonzalo Navarro. |
[Regul] Implementing the Topological Model Succinctly. |
Jose Fuentes, Gonzalo Navarro and Diego Seco. |
[Regul] Space- and Time-Efficient Storage of LiDAR Point Clouds. |
Susana Ladra, Miguel R. Luaces, Jose R. Parama and Fernando Silva-Coira. |
17:00 |
Closing |
Social Programme: Welcome cocktail; Guided visit+ Monday dinner; and Conference Banquet
On Sunday 6th, we will enjoy a Welcome cocktail at 21:00h at Hotel San Antonio El Real.
On Monday 7th, there will be a Guided tour around the city that will start at «Plaza del Azoguejo» (main view of the Aqueduct) at 17.30h, and will end at «Plaza Mayor» around 20:30h. We will visit the most representative areas/monuments/places of the city including among others the Aqueduct, the Jewish quarter, a guided visit to the Alcázar de Segovia (the beautiful fortress in the North-East side of the city), the Cathedral, and Plaza Mayor.
After the tour we will enjoy a «rather informal» Monday Dinner at Restaurante Casares at 21:00h.
On Tuesday 8th, we will have the conference banquet at Restaurante – Asador El Hidalgo. at 21:00h
Regular Papers Accepted
Arezoo Abdollahi, Neil Bruce, Shahin Kamali and Rezaul Karim.
Lossless Image Compression Using List Update Algorithms
Abstract: We consider lossless image compression using a technique similar to bZip2 for sequential data. Given an image represented with a matrix of pixel values, we consider different approaches for linearising the image into a sequence and then encoding the sequence using the Move-To-Front list update algorithm. In both linearisation and encoding stages, we exploit the locality present in the images to achieve encodings that are as compressed as possible. We consider a few approaches, and in particular Hilbert space-filling curve, for linearising the image. Using a natural model of locality for images introduced by Albers et al. [J. Comput. Syst. Sci. 2015], we establish the advantage of Hilbert-filling curve over other linearisation techniques such as row-major or column-major curves for preserving the locality during the linearisation. We also use a result by Angelopoulos and Schweitzer [J. ACM 2013] to select Move-To-Front as the best list update algorithm for encoding the linearised sequence. In summary, our theoretical results show that a combination of Hilbert space-filling curve and Move-To-Front encoding has advantage over other approaches. We verify this with experiments on a dataset consisting of different categories of images.
Billel Aklouche, Ibrahim Bounhas and Yahya Slimani.
BM25 Beyond Query-Document Similarity
Abstract: The massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval effectiveness. However, selecting good expansion terms using co-occurrence graphs is challenging. In this paper, we present an adapted version of the BM25 model, which allows measuring the similarity between terms. First, a context window-based approach is applied over the entire corpus in order to construct the term co-occurrence graph. Afterward, using the proposed adapted version of BM25, candidate expansion terms are selected according to their similarity with the whole query. This measure stands out by its ability to evaluate the discriminative power of terms and select semantically related terms to the query. Experiments on two ad-hoc TREC collections (the standard Robust04 collection and the new TREC Washington Post collection) show that our proposal outperforms the baselines over three state-of-the-art IR models and leads to significant improvements in retrieval effectiveness.
Mai Alzamel, Alessio Conte, Daniele Greco, Veronica Guerrini, Costas Iliopoulos, Nadia Pisanti, Nicola Prezza, Giulia Punzi and Giovanna Rosone.
Online Algorithms on Antipowers and Antiperiods
Abstract: The definition of antipower introduced by Fici et al. (ICALP 2016) captures the notion of being the opposite of a power: a sequence of k pairwise distinct blocks of the same length.
Recently, Alamro et al. (CPM 2019) defined a string to have an antiperiod if it is a prefix of an antipower, and gave complexity bounds for the offline computation of the minimum antiperiod and all the antiperiods of a word.
In this paper, we address the same problems in the online setting. Our solutions rely on new arrays that compactly and incrementally store antiperiods and antipowers as the word grows, obtaining in the process this information for all the word’s prefixes. We show how to compute those arrays online in O(n \log n) space, O(n\log n) time, and o(n^\epsilon) delay per character, for any constant \epsilon>0. Running times are worst-case and hold with high probability. We also discuss more space-efficient solutions returning the correct result with high probability, and small data structures to support random access to those arrays.
Diego Arroyuelo, Guillermo De Bernardo, Travis Gagie and Gonzalo Navarro.
Faster Dynamic Compressed $d$-ary Relations
Abstract: The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
Diego Arroyuelo and Erick Sepúlveda.
A Practical Alphabet-Partitioning Rank/Select Data Structure
Abstract: This paper proposes a practical implementation of an alphabet-partitioning compressed data structure, which represents a string within compressed space and supports the fundamental operations rank and select efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach (which is one of the most efficient approaches in practice). In particular, the time for operation select can be reduced by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), and the distributed-computation processing of rank and select operations. As far as we know, this is the first study about the support of rank/select operations on a distributed-computing environment.
Diego Arroyuelo and Rajeev Raman.
Adaptive Succinctness
Abstract: Bit-vectors are among the most fundamental building blocks of succinct data structures. Although a random bit-string of u bits is not compressible, bit-strings encountered in practice are compressible. We propose new measures of compressibility of bit-strings and give bit-vector data structures that support rank/select in space close to these measures.
In particular, we show that for certain static sets, one can obtain smaller encodings, breaking the already known (worst-case) information theory lower bound for representing sets. Let C_n denote the class of (u choose n) distinct sets of n elements over the universe [1..u]. Let also C^{n}_{g} \subset C_n contain the sets whose n elements are arranged in g \le n runs of l_i \ge 1 successive elements each (i=1,…, g), and C^{n}_{g, r} \subset C^{n}_{g} contain all sets that consist of g runs of successive elements, such that r \le g of them have at least 2 elements. This paper yields the following insights and contributions for rank/select succinct data structures:
– We introduce new information-theory lower bounds for representing sets. We prove that |C^{n}_{g}| = (u-n+1 choose g)*(n-1 choose g-1) \le (u choose n).Thus, identifying a set S \in C^{n}_{g} requires (in the worst case) at least L_1 = lg{|C^{n}_{g}|} = lg{(u-n+1 choose g)} + lg{(n-1 choose g-1)} bits. Moreover, we prove that |C^{n}_{g,r}| = (u-n+1 choose g)*(n-g-1 choose r-1)*(g choose r) \le |C^{n}_{g}|, hence L_2 = lg{(u-n+1 choose g)} + lg{(n-g-1 choose r-1)} + \lg{(g choose r)}$ bits are necessary. This indicates that, among all sets of n elements over [1..u], there are some sets that can be represented more succinctly than others.
– We introduce adaptive succinct data structures, using space close to bounds L_1 and L_2. We show that lg{(u choose g)} + lg{(n choose g)} + O(\frac{u^\varepsilon}{n^c}) bits of space suffice to support rank, select, and member in O(1) time. We also show that lg{(u choose g)} + lg{(n choose r)} + lg{(g choose r)} + O(\frac{u^\varepsilon}{n^c}) bits of space suffice to support constant-time operations.
Grigor Aslanyan and Utkarsh Porwal.
Position Bias Estimation for Unbiased Learning-to-Rank in eCommerce Search
Abstract: The Unbiased Learning-to-Rank framework has been recently proposed as a general approach to systematically remove biases, such as position bias, from learning-to-rank models. The method takes two steps – estimating click propensities and using them to train unbiased models. Most common methods proposed in the literature for estimating propensities involve some degree of intervention in the live search engine. An alternative approach proposed recently uses an Expectation Maximization (EM) algorithm to estimate propensities by using ranking features for estimating relevances. In this work we propose a novel method to directly estimate propensities which does not use any intervention in live search or rely on modeling relevance. Rather, we take advantage of the fact that the same query-document pair may naturally change ranks over time. This typically occurs for eCommerce search because of change of popularity of items over time, existence of time dependent ranking features, or addition or removal of items to the index (an item getting sold or a new item being listed). However, our method is general and can be applied to any search engine for which the rank of the same document may naturally change over time for the same query. We derive a simple likelihood function that depends on propensities only, and by maximizing the likelihood we are able to get estimates of the propensities. We apply this method to eBay search data to estimate click propensities for web and mobile search and compare these with estimates using the EM method. We also use simulated data to show that the method gives reliable estimates of the «»true»» simulated propensities. Finally, we train an unbiased learning-to-rank model for eBay search using the estimated propensities and show that it outperforms both baselines – one without position bias correction and one with position bias correction using the EM method.
Gianfranco Bilardi and Michele Schimd.
Bounds and Estimates on the Average Edit Distance
Abstract: The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. An open problem is the exact value of the average edit distance $\alpha_k(n) n$, between random strings of $n$ characters from an alphabet of a given size $k$. Moreover, while it is known that for increasing $n$, $\alpha_{k}(n)$ approaches a limit $\alpha_{k}$, the exact value of this limit is unknown, for any $k>1$.
This paper presents an upper bound to $\alpha_{k}$ based on the exact computation of some $\alpha_k(n)$ and a lower bound to $\alpha_{k}$ based on combinatorial arguments on edit scripts. Statistical estimates of $\alpha_{k}$ are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes $k$. In particular, for a binary alphabet, the rigorous bounds are $0.1742 \leq \alpha_2 \leq 0.3693$ while the obtained estimate is $\alpha_2 \approx 0.2888$; for a quaternary alphabet, $0.3598 \leq \alpha_4 \leq 0.6318$ and $\alpha_4 \approx 0.5180$. These values improve over the best published results.
Timo Bingmann, Phelim Bradley, Florian Gauger and Zamin Iqbal.
COBS: a Compact Bit-Sliced Signature Index
Abstract: We present COBS, a compact bit-sliced signature index, which is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. We compare COBS to seven other index software packages on 100 000 microbial DNA samples. COBS’ compact but simple data structure outperforms the other indexes in construction time and query performance with Mantis by Pandey et al. on second place. However, different from Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.
Manuel Cáceres and Gonzalo Navarro.
Faster Repetition-Aware Compressed Suffix Trees based on Block Trees
Abstract: Suffix trees are a fundamental data structure in stringology, but their space usage, though linear, is an important problem in applications. We design and implement a new compressed suffix tree targeted to highly repetitive texts, such as large genomic collections of the same species. Our suffix tree builds on Block Trees, a recent Lempel-Ziv-bounded data structure that captures the repetitiveness of its input. We use Block Trees to compress the topology of the suffix tree, and augment the Block Tree nodes with data that speeds up suffix tree navigation.
Our compressed suffix tree is slightly larger than previous repetition-aware suffix trees based on grammars, but outperforms them in time, often by orders of magnitude. The component that represents the tree topology achieves a speed comparable to that of general-purpose compressed trees, while using 2.3–10 times less space, and might be of independent interest.
Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen and Tuukka Norri.
Linear Time Maximum Segmentation problems in Column Stream Model
Abstract: We study a lossy compression scheme linked to a biological problem about founder reconstruction: The goal in founder reconstruction is to replace a set of strings with a smaller set of founders such that the original connections are maintained as well as possible. A general formulation of this problem is NP-hard, but when limiting to reconstructions that form a segmentation of the input strings, polynomial time solutions exist. We proposed in our earlier work (WABI 2018) a linear time solution to a formulation where minimum segment length was bounded, but it was left open if the same running time can be obtained when the targeted compression level (number of founders) is bounded and lossyness is minimized. This optimization is captured by the Maximum Segmentation problem: Given a threshold M and a set R = {R1,…,Rm} of strings of the same length n, find a minimum cost partition P where for each segment [i,j] in P, the compression level |{Rk[i,j]: 1 <= k <= m}| is bounded from above by M. We give linear time algorithms to solve the problem for two different (compression quality) measures on P: the average length of the intervals of the partition and the length of the minimal interval of the partition. These algorithms make use of positional Burrows--Wheeler transform and the range maximum queue, an extension of range maximum queries to the case where the input string can be operated as a queue. For the latter, we present a new solution that may be of independent interest. The solutions work in a streaming model where one column of the input strings is introduced at a time.
Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń and Wiktor Zuba.
Weighted Shortest Common Supersequence Problem Revisited
Abstract: We revisit the Weighted Shortest Common Supersequence (WSCS) problem, that is, the SCS problem on weighted strings, that was introduced by Amir et al. [SPIRE 2011]. A weighted string, also known as a Position Weight Matrix (PWM), is a sequence of probability distributions over the alphabet. In the WSCS problem we are given two weighted strings $W_1$ and $W_2$ and a threshold $1/z$ on probability and are to compute the shortest (standard) string $S$ such that each of $W_1$, $W_2$ matches a subsequence of $S$ (not necessarily the same) with probability at least $1/z$. Amir et al. showed that a log-probability version of this problem is NP-complete.
We present an algorithm that solves the WSCS problem for two weighted strings of length $n$ and over a constant-sized alphabet in $O(n^2\sqrt{z} \log{z})$ time. This matches conditional lower bounds since it is known that WSCS cannot be solved in $O(n^{2-\varepsilon})$ or $O^*(z^{0.5-\varepsilon})$ time unless there is a breakthrough improving upon long-standing upper bounds for fundamental NP-hard problems (CNF-SAT and Subset Sum, respectively).
We also discover a fundamental difference between the WSCS problem and the Weighted Longest Common Subsequence (WLCS) problem that was introduced by Amir et al. [JDA 2010] by showing that WLCS cannot be solved in $n^{O(1)} f(z)$ time, for any function $f(z)$, unless $P=NP$.
Dustin Cobas and Gonzalo Navarro.
Fast, Small, and Simple Document Listing on Repetitive Text Collections
Abstract: Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $\ndoc$ distinct documents where a pattern of length $m$ appears in time $\Oh(m+\ndoc \cdot \log n)$. We exploit the repetitiveness of the {\em document array} (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.
Alessio Conte, Roberto Grossi, Giulia Punzi and Takeaki Uno.
Polynomial-Delay Enumeration of Maximal Common Subsequences
Abstract: A Maximal Common Subsequence (MCS) between two strings X and Y is an inclusion-maximal subsequence of both X and Y. MCSs are a natural generalization of the classical concept of Longest Common Subsequence (LCS), which can be seen as a longest MCS. We study the problem of efficiently listing all the distinct MCSs between two strings. As discussed in the paper, this problem is algorithmically challenging as the same MCS cannot be listed multiple times: for example, dynamic programming [Fraser et al., CPM 1998] incurs in an exponential waste of time, and a recent algorithm for finding an MCS [Sakai, CPM 2018] does not seem to immediately extend to listing. We follow an alternative and novel graph-based approach, proposing the first output-sensitive algorithm for this problem: it takes polynomial time in n per MCS found, where n = max {|X|,|Y|}, with polynomial preprocessing time and space.
Diego Diaz.
An Index for Sequencing Reads Based on The Colored de Bruijn Graph
Abstract: We present a succinct index for representing a collection of DNA sequencing reads. Our index is a de Bruijn graph of order k of the set, in which the path of every read is assigned a particular color. We reduce the space usage of the data structure by performing an incomplete coloring, i.e., not all the nodes in the path of a read are colored, and by reusing the same colors in those reads that do not have (k − 1)- substrings in common, whenever it is possible. Additionally, we propose two algorithms that work on top of the index, one for reconstructing the reads from the graph and other for contig assembly. These algorithms exploit the color information to produce better results than uncolored de Bruijn graphs. Experimental results show that our index uses about half the space of the plain representation of the input collection and that more than 99% of the original reads can be reconstructed just from our data structure.
Lavinia Egidi, Felipe A. Louza and Giovanni Manzini.
Space-Efficient Merging of Succinct de Bruijn Graphs
Abstract: We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds.
Jonas Ellert and Florian Kurpicz.
Parallel External Memory Wavelet Tree and Wavelet Matrix Construction
Abstract: We present the first parallel external memory wavelet tree and matrix construction algorithm. The algorithm’s throughput is nearly the same as the hard disk drives’ throughput, using six cores. We also present the fastest (parallel) semi-external construction algorithms for both wavelet trees and matrices.
Jose Fuentes, Gonzalo Navarro and Diego Seco.
Implementing the Topological Model Succinctly
Abstract: We show that the topological model, a semantically rich standard to represent GIS data, can be encoded succinctly while efficiently answering a number of topology-related queries. We build on recent succinct planar graph representations so as to encode a model with $m$ edges within $4m+o(m)$ bits and answer various queries relating nodes, edges, and faces in $o(\log\log m)$ time, or any time in $\omega(\log m)$ for a few complex ones.
Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets
Abstract: We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $\Sigma$ and parameterized alphabet $\Pi$, our algorithm runs in $O(n\pi)$ time and $O(n)$ words of space, where $\pi$ is the number of distinct symbols of $\Pi$ in the string.
Danny Hucke.
Approximation ratios of RePair, LongestMatch and Greedy on unary strings
Abstract: A grammar-based compressor computes for a given input w a context-free grammar that produces only w. So-called global grammar-based compressors (RePair, LongestMatch and Greedy) achieve impressive practical compression results, but the recursive character of those algorithms makes it hard to achieve strong theoretical results. To this end, this paper studies the approximation ratio of those algorithms for unary input strings, which is strongly related to the field of addition chains. We show that in this setting, RePair and LongestMatch produce equal size grammars that are by a factor of at most log2(3) larger than a smallest grammar. We also provide a matching lower bound. The main result of this paper is a new lower bound for Greedy of 1.348…, which improves the best known lower bound for arbitrary (not necessarily unary) input strings.
N. Jesper Larsson.
Run-Length Encoding in a Finite Universe
Abstract: Text compression schemes and succinct data structures usually combine sophisticated probability models with basic coding methods whose average codeword length closely match the entropy of known distributions. In the frequent case where basic coding represents run-lengths of outcomes that have probability p, i.e. geometric distribution Pr(i)=pⁱ(1-p), a Golomb code is an optimal instantaneous code, which has the additional advantage that codewords can be computed using only an integer parameter calculated from p, without need for a large or sophisticated data structure. Golomb coding does not, however, gracefully handle the case where run-lengths are bounded by a known integer n. In this case, codewords allocated for the case i>n are wasted. While negligible for large n, this makes Golomb coding unattractive in situations where n is recurrently small, e.g., when representing many short lists of integers drawn from limited ranges, or when the range of n is narrowed down by a recursive algorithm.
We address the problem of choosing a code for this case, considering efficiency from both information-theoretic and computational perspectives, and arrive at a simple code that allows computing a codeword using only O(1) simple computer operations and O(1) machine words. We demonstrate experimentally that the resulting representation length is very close (equal in a majority of tested cases) to the optimal Huffman code, to the extent that the expected difference is practically negligible. We describe efficient branch-free implementation of encoding and decoding.
Kazuki Kai, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda and Tomasz Kociumaka.
On Longest Common Property Preserved Substring Queries
Abstract: We revisit the problem of longest common property preserving substring queries first proposed by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings with total length $n$ that can be pre-processed, so that, when given a query string $y$ and a positive integer $k’\leq k$, we would like to answer the longest substring of $y$ that satisfies some specific property and is common to at least $k’$ strings in $X$. Ayad et al. considered the longest square-free substring in the on-line setting, and the longest periodic and palindromic substring in the off-line setting. In this paper, we give efficient solutions in the on-line setting for finding the longest common square, periodic, palindromic, and Lyndon substrings. More precisely, we show that $X$ can be pre-processed in $O(n)$ time to build a data structure of $O(n)$ size so that queries can be answered in $O(|y|\log\sigma)$ time and $O(1)$ working space, where $\sigma$ is the size of the alphabet, when the common substring must be a square, a periodic substring, a palindrome, or a Lyndon word.
Yusaku Kaneta, Takeaki Uno and Hiroki Arimura.
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Abstract: The $\epsilon$-approximate $\phi$-heavy hitters problem is, for any element from some universe $\mathbb{U}=[0..n)$, to maintain its frequency under an arbitrary data stream of form $(x_i, \Delta_i)\in\mathbb{U}\times\mathbb{Z}$ that changes the frequency of $x_i$ by $\Delta_i$, such that one can report every element with frequency more than $\phi N$ and no element with frequency no more than $(\phi-\epsilon)N$ for $N=\sum_i \Delta_i$ and prespecified $\epsilon, \phi\in\mathbb{R}$. To solve this problem in small space, Cormode and Muthukrishnan~(ACM TODS, 2005) have proposed an $O(\tau\epsilon^{-1}\lg{n})$-space probabilistic data structure with good practical performance, where $\tau=\lg{(1/(\delta\phi))}$ for any failure probability $\delta\in\mathbb{R}$. In this paper, we improve its output time from $O(\tau\epsilon^{-1}(\lg{n}+\tau))$ to $O(\tau^2\epsilon^{-1})$ for arbitrary updates ($\Delta_i\in\mathbb{Z}$ for any~$i$) and its update time from $O(\tau\lg{n})$ to $O(\tau)$ for unitary updates ($\Delta_i\in\{\pm 1\}$ for any~$i$) with the same space and output guarantee, by removing $\lg{n}$ terms that are basically application-specific (e.g.~$64$ in tracking each IPv6 address) and not tunable unlike $\delta$, $\epsilon$, and $\phi$.
Susana Ladra, Miguel R. Luaces, Jose R. Parama and Fernando Silva-Coira.
Space- and Time-Efficient Storage of LiDAR Point Clouds
Abstract: LiDAR devices obtain a 3D representation of a space. Due to the large size of the resulting datasets, there already exist storage methods that use compression and present some properties that resemble those of compact data structures. Specifically, LAZ format allows accesses to a given datum or portion of the data without having to decompress the whole dataset and provides indexation of the stored data. However, LAZ format still have some drawbacks that should be faced.
In this work, we propose a new compact data structure for the representation of a cloud of LiDAR points that supports efficient queries, providing indexing capabilities that are superior to those of LAZ format.
Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino and Guilherme P. Telles.
Inducing the Lyndon Array
Abstract: In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in O(n) time using \sigma + O(1) words of working space, where n is the length of the text and \sigma is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use O(n) words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice.
Oleg Merkurev and Arseny Shur.
Searching Runs in Streams
Abstract: Runs, or maximal periodic substrings, capture the whole picture of periodic fragments in a string. Computing all runs of a string in the usual RAM model is a well-studied problem. We approach this problem in the streaming model, where input symbols arrive one by one and the available space is sublinear. We show that no streaming algorithm can identify all squares, and hence all runs, in a string. Our main contribution is a Monte Carlo algorithm which approximates the set of runs in the length-$n$ input stream $S$. Given the error parameter $\varepsilon=\varepsilon(n)\in(0,\frac12]$, the algorithm reports a set of periodic substrings such that for each run of exponent $\beta\ge 2+\varepsilon$ in $S$ a single its substring is reported, with the same period and the exponent at least $\beta-\varepsilon$; for runs of smaller exponent, zero or one substring of exponent at least 2 is reported. The algorithm uses $O(\frac{\log^2 n}{\varepsilon})$ words of memory and spends $O(\log n)$ operations, including dictionary operations, per symbol.
Takuya Mieno, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Compact Data Structures for Shortest Unique Substring Queries
Abstract: Given a string T of length n, a substring u = T[i.. j] of T is called a shortest unique substring (SUS) for an interval [s, t] if (a) u occurs exactly once in T, (b) u contains the interval [s, t] (i.e. i \leq s \leq t \leq j), and (c) every substring v of T with |v| < |u| containing [s, t] occurs at least twice in T. Given a query interval [s, t] \subset [1, n], the interval SUS problem is to output all the SUSs for the interval [s, t]. In this article, we propose a 4n + o(n) bits data structure answering an interval SUS query in output-sensitive O(occ) time, where occ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for s = t. Here, we propose a \lceil (log2 3 + 1)n \rceil + o(n) bits data structure answering a point SUS query in the same output-sensitive time.
Siwoo Song, Cheol Ryu, Simone Faro, Thierry Lecroq and Kunsoo Park.
Fast Cartesian Tree Matching
Abstract: Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching.
Short Papers Accepted
Paniz Abedin, Arnab Ganguly, Solon Pissis and Sharma Thankachan.
Range Shortest Unique Substring Queries
Abstract: Let $\T[1,n]$ be a string of length $n$ and $\T[i,j]$ be the substring of $\T$ starting at position $i$ and ending at position $j$. A substring $\T[i,j]$ of $\T$ is a repeat if it occurs more than once in $\T$, otherwise it is a unique substring of $\T$. Repeats and unique substrings are of great interest in computational biology and in information retrieval. Given string $\T$ as input, the {\it Shortest Unique Substring} problem is to find a shortest substring of $\T$ that does not occur elsewhere in $\T$. In this paper we introduce the range variant of this problem, which we call the {\it Range Shortest Unique Substring} problem. The task is to construct a data structure over $\T$ answering the following type of online queries efficiently. Given a range $[\alpha, \beta]$, return a shortest substring $\T[i,j]$ of $\T$ with exactly one occurrence in $[\alpha, \beta]$. We present an $\cO(n\log n)$-word data structure with $\cO(\log_w n)$ query time, where $w=\Omega(\log n)$ is the word size. Our construction is based on a non-trivial reduction allowing us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018].
Johannes Bahne, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, Florian Kurpicz, Marvin Löbel, Oliver Magiera, Rosa Pink, David Piper and Christopher Poeplau.
SACABench: Benchmarking Suffix Array Construction
Abstract: We present a practical comparison of suffix array construction algorithms on modern hardware. The benchmark is conducted using our new benchmark framework SACABench, which allows for an easy deployment of publicly available implementations, simple plotting of the results, and straight forward support to include new construction algorithms. We use the framework to develop a construction algorithm running on the GPU that is competitive with the fastest parallel algorithm in our test environment.
Lorenzo Beretta, Franco Maria Nardini, Roberto Trani and Rossano Venturini.
An Optimal Algorithm to Find Champions of Tournament Graphs
Abstract: A tournament graph $T = (V, E )$ is an oriented complete graph, which can be used to model a round-robin tournament between $n$ players. In this short paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. Our goal is to solve the problem by minimizing the number of arc lookups, i.e., the number of matches played. We prove that finding a champion requires $\Omega(\ell n)$ comparisons, where $\ell$ is the number of matches lost by the champion, and we present a deterministic algorithm that matches this lower bound. Solving this problem has important implications on several Information Retrieval applications including Web search, conversational AI, machine translation, question answering, recommender systems, etc.
Davide Della Giustina, Nicola Prezza and Rossano Venturini.
A new Linear-time Algorithm for Centroid Decomposition
Abstract: The centroid of a tree is a node that, when removed, breaks the tree in subtrees of size at most half of that of the original tree. By recursing this procedure on the subtrees, one obtains the centroid decomposition of the tree, also known as centroid tree. The centroid tree has logarithmic height and its construction is a powerful pre-processing step in several tree-processing algorithms. The folklore recursive algorithm for computing the centroid tree runs in O(n log n) time. To the best of our knowledge, the only result claiming O(n) time is unpublished and relies on (dynamic) heavy path decomposition of the original tree. In this short paper, we describe a new simple and practical linear-time algorithm for the problem based on the idea of applying the folklore algorithm to a suitable decomposition of the original tree.
Gabriele Fici and Pawel Gawrychowski.
Minimal Absent Words in Rooted and Unrooted Trees
Abstract: We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet $\Sigma$ of cardinality $\sigma$. We show that the set $\MAW(T)$ of minimal absent words of a rooted (resp.~unrooted) tree $T$ with $n$ nodes has cardinality $O(n\sigma)$ (resp.~$O(n^{2}\sigma)$), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp.~unrooted) tree in output-sensitive time $O(n+|\MAW(T)|)$ (resp.~$O(n^{2}+|\MAW(T)|)$.
Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto and Yoshimasa Takabatake.
Rpair: Scaling up RePair with Rsync
Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially compression schemes supporting computation on the data without decompression. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be applied to the original dataset quickly and in external memory. In this paper we show how we can preprocess massive repetitive datasets with a variant of the popular tool Rsync, such that afterwards we can apply RePair and other grammar-based compressors using much less time and space. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds for grammar-based compression, and finally give experimental evidence that our approach is competitive in practice.
Enno Ohlebusch and Pascal Weber.
On the computation of longest previous non-overlapping factors
Abstract: The f-factorization of a string is similar to the well-known Lempel-Ziv (LZ) factorization, but differs from it in that the factors must be non-overlapping. There are two linear time algorithms that compute the f-factorization. Both of them compute the array of longest previous non-overlapping factors (LPnF-array), from which the f-factorization can easily be derived. In this paper, we present a simple algorithm that computes the LPnF-array from the LPF-array and an array prevOcc that stores positions of previous occurrences of LZ-factors. The algorithm has a linear worst-case time complexity if prevOcc contains leftmost positions. Moreover, we provide an algorithm that computes the f-factorization directly. Experiments show that our first algorithm (combined with efficient LPF-algorithms) is the fastest and our second algorithm is the most space efficient method to compute the f-factorization.
Anais Ollagnier and Hywel Williams.
Network-Based Pooling for Topic Modeling on Microblog Content
Abstract: This paper investigates a new tweet-pooling method based on network structures associated with Twitter content. Using a standard formulation of the well-known Latent Dirichlet Allocation (LDA) topic model, we trained various models using different tweet-pooling schemes on three diverse Twitter datasets. Tweet-pooling schemes were created based on mention/reply relationships between tweets and Twitter users, with several (non-networked) established methods also tested as a comparison. Results show that pooling tweets using network information gives better topic coherence and clustering performance than other pooling schemes, on the majority of datasets tested. Our findings contribute to an improved methodology for topic modeling with Twitter content.