# Invited Speakers

**Alistair Moffat **

Alistair Moffat has been involved in research in text compression and information retrieval for more than three decades, and has published numerous papers in the areas of index compression, text compression, and dynamic pruning mechanisms, all of which help support efficient ranked querying. Alistair is a co-author of the 1994 (revised 1999) book Managing Gigabytes, and also co-author of the 2002 book Compression and Coding Algorithms. Much of Alistair’s recent work has examined the issue of IR system evaluation, and, with other co-authors in Australia, he has focused on the relationship between models of user interactions with search results pages, and the effectiveness metrics that those interactions correspond to. Alistair was co-Chair for SIGIR 1998 in Melbourne, and for CIKM 2015, also held in Melbourne; and co-Program Committee Chair for SIGIR 2005 (Salvador, Brazil) and SIGIR 2015 (Santiago, Chile). WSDM, held in Melbourne in February 2019, was Alistair’s most recent conference project.

Alistair has been a teaching/research faculty member at the University of Melbourne since 1986, and was Department Chair from 2007-2011. During his career at Melbourne he has taught programming skills to well in excess of 15,000 undergraduate students, has authored a C programming textbook (Programming, Problem Solving, and Abstraction with C, 2002, revised 2012), and has received awards for his teaching and lecturing skills.

Alistair’s PhD was completed in 1985, at the University of Canterbury, in New Zealand, in the area of shortest path algorithms.

**Gonzalo Navarro **

Gonzalo Navarro completed his PhD in Computer Science in 1998 at the University of Chile, where he is currently full professor. His areas of interest include algorithms and data structures, text searching, compression, and metric space searching.

He has directed the Millennium Nucleus Center for Web Research, RIBIDI (an Ibero American project funded by CYTED), and a project funded by Yahoo! Research, apart from smaller projects. He has participated in various research projects, such as the Millennium Institute for Cell Dynamics and Biotechnology, an ECOS/CONICYT project (Chile-France cooperation), AMYRI (a CYTED project), and a Fondef project. Currently, he participates in the Millennium Nucleus Information and Coordination in Networks and in the Center for Biotechnology and Bioengineering.

He has been PC (co-)chair of several conferences: SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR 2005 Posters, IFIP TCS 2006, a track of ENC 2007, SISAP 2008, SISAP 2012, LATIN 2016, SPIRE 2018, and CPM 2018. He co-created SISAP on 2008, and was Steering Committee member of SPIRE, LATIN, and SISAP. He is a member of the Editorial Board of the journals Information Retrieval, ACM Journal of Experimental Algorithmics, and Information Systems. He has been guest editor of special issues in ACM SIGSPATIAL, Journal of Discrete Algorithmics, Information Systems, and Algorithmica. He has been PC member of more than 50 international conferences and reviewer for about 40 international journals. He has given around 50 invited talks in several universities and international conferences, including 12 plenary talks and 5 tutorials in international conferences. He created in 2005 the Workshop on Compression, Text, and Algorithms, which has become a permanent satellite of SPIRE.

He has coauthored two books published by Cambridge University Press, about 25 book chapters, 8 proceedings of international conferences (editor), more than 150 papers in international journals, and over 230 in international conferences. He is one of the most prolific and highly cited authors in Latin America.

He has advised 6 postdocs, 18 PhDs, 15 MScs, and 16 undergraduate theses.

**Veli Mäkinen **

Veli Mäkinen is Professor of Computer Science at the University of Helsinki. He heads the Genome-scale algorithmics research group and is the Director of the Master’s Programme in Computer Science. He has adviced 8 postdocs and supervised 4 PhD theses and 36 Master’s theses. He is currently supervising three PhD theses.

Veli Mäkinen started his career in string algorithms and compressed data structures. As of 2019, he has over 100 publications on these and related topics, with the focus being shifted towards algorithmic bioinformatics, where different high-throughput sequencing data analysis scenarios make near-linear time algorithms that work in small space an appealing target of study. Inspired by this new angle to bioinformatics algorithms, Veli Mäkinen co-authored a textbook on Genome-scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing, published by Cambridge University Press, 2015.

The new textbook works as the main material for two algorithmic bioinformatics courses taught frequently by Veli Mäkinen. He also teaches the general algorithms courses Design and Analysis of Algorithms, String Processing Algorithms, Data Compression Techniques, and projects and seminars related to these topics.

Some current research interests include studying algorithms and computational complexity when moving from sequences to variation graphs in (pan-)genomics, and studying applications of different index structures related to the variants of the Burrows-Wheeler transform.

Besides research and teaching, Veli Mäkinen has an active role in the research community. Some recent activities include co-chairing IWOCA 2016 and guest editing a special issue for Theory of Computing Systems, program committee memberships (WABI 2013-2018, MFCS 2016, RECOMB 2017-2019, SPIRE 2018, CPM 2018-2019), and a keynote talk at the ECCB 2016 workshop on Computational Pan-Genomics.

# Programme

It will be included here when available. All papers accepted in SPIRE will be allocated a time slot for oral presentation.

# Regular Papers Accepted

Lossless Image Compression Using List Update Algorithms

**Abstract:**We consider lossless image compression using a technique similar to bZip2 for sequential data. Given an image represented with a matrix of pixel values, we consider different approaches for linearising the image into a sequence and then encoding the sequence using the Move-To-Front list update algorithm. In both linearisation and encoding stages, we exploit the locality present in the images to achieve encodings that are as compressed as possible. We consider a few approaches, and in particular Hilbert space-filling curve, for linearising the image. Using a natural model of locality for images introduced by Albers et al. [J. Comput. Syst. Sci. 2015], we establish the advantage of Hilbert-filling curve over other linearisation techniques such as row-major or column-major curves for preserving the locality during the linearisation. We also use a result by Angelopoulos and Schweitzer [J. ACM 2013] to select Move-To-Front as the best list update algorithm for encoding the linearised sequence. In summary, our theoretical results show that a combination of Hilbert space-filling curve and Move-To-Front encoding has advantage over other approaches. We verify this with experiments on a dataset consisting of different categories of images.

BM25 Beyond Query-Document Similarity

**Abstract:**The massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval effectiveness. However, selecting good expansion terms using co-occurrence graphs is challenging. In this paper, we present an adapted version of the BM25 model, which allows measuring the similarity between terms. First, a context window-based approach is applied over the entire corpus in order to construct the term co-occurrence graph. Afterward, using the proposed adapted version of BM25, candidate expansion terms are selected according to their similarity with the whole query. This measure stands out by its ability to evaluate the discriminative power of terms and select semantically related terms to the query. Experiments on two ad-hoc TREC collections (the standard Robust04 collection and the new TREC Washington Post collection) show that our proposal outperforms the baselines over three state-of-the-art IR models and leads to significant improvements in retrieval effectiveness.

Online Algorithms on Antipowers and Antiperiods

**Abstract:**The definition of antipower introduced by Fici et al. (ICALP 2016) captures the notion of being the opposite of a power: a sequence of k pairwise distinct blocks of the same length.

Recently, Alamro et al. (CPM 2019) defined a string to have an antiperiod if it is a prefix of an antipower, and gave complexity bounds for the offline computation of the minimum antiperiod and all the antiperiods of a word.

In this paper, we address the same problems in the online setting. Our solutions rely on new arrays that compactly and incrementally store antiperiods and antipowers as the word grows, obtaining in the process this information for all the word’s prefixes. We show how to compute those arrays online in O(n \log n) space, O(n\log n) time, and o(n^\epsilon) delay per character, for any constant \epsilon>0. Running times are worst-case and hold with high probability. We also discuss more space-efficient solutions returning the correct result with high probability, and small data structures to support random access to those arrays.

Faster Dynamic Compressed $d$-ary Relations

**Abstract:**The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.

A Practical Alphabet-Partitioning Rank/Select Data Structure

**Abstract:**This paper proposes a practical implementation of an alphabet-partitioning compressed data structure, which represents a string within compressed space and supports the fundamental operations rank and select efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach (which is one of the most efficient approaches in practice). In particular, the time for operation select can be reduced by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), and the distributed-computation processing of rank and select operations. As far as we know, this is the first study about the support of rank/select operations on a distributed-computing environment.

Adaptive Succinctness

**Abstract:**Bit-vectors are among the most fundamental building blocks of succinct data structures. Although a random bit-string of u bits is not compressible, bit-strings encountered in practice are compressible. We propose new measures of compressibility of bit-strings and give bit-vector data structures that support rank/select in space close to these measures.

In particular, we show that for certain static sets, one can obtain smaller encodings, breaking the already known (worst-case) information theory lower bound for representing sets. Let C_n denote the class of (u choose n) distinct sets of n elements over the universe [1..u]. Let also C^{n}_{g} \subset C_n contain the sets whose n elements are arranged in g \le n runs of l_i \ge 1 successive elements each (i=1,…, g), and C^{n}_{g, r} \subset C^{n}_{g} contain all sets that consist of g runs of successive elements, such that r \le g of them have at least 2 elements. This paper yields the following insights and contributions for rank/select succinct data structures:

– We introduce new information-theory lower bounds for representing sets. We prove that |C^{n}_{g}| = (u-n+1 choose g)*(n-1 choose g-1) \le (u choose n).Thus, identifying a set S \in C^{n}_{g} requires (in the worst case) at least L_1 = lg{|C^{n}_{g}|} = lg{(u-n+1 choose g)} + lg{(n-1 choose g-1)} bits. Moreover, we prove that |C^{n}_{g,r}| = (u-n+1 choose g)*(n-g-1 choose r-1)*(g choose r) \le |C^{n}_{g}|, hence L_2 = lg{(u-n+1 choose g)} + lg{(n-g-1 choose r-1)} + \lg{(g choose r)}$ bits are necessary. This indicates that, among all sets of n elements over [1..u], there are some sets that can be represented more succinctly than others.

– We introduce adaptive succinct data structures, using space close to bounds L_1 and L_2. We show that lg{(u choose g)} + lg{(n choose g)} + O(\frac{u^\varepsilon}{n^c}) bits of space suffice to support rank, select, and member in O(1) time. We also show that lg{(u choose g)} + lg{(n choose r)} + lg{(g choose r)} + O(\frac{u^\varepsilon}{n^c}) bits of space suffice to support constant-time operations.

Position Bias Estimation for Unbiased Learning-to-Rank in eCommerce Search

**Abstract:**The Unbiased Learning-to-Rank framework has been recently proposed as a general approach to systematically remove biases, such as position bias, from learning-to-rank models. The method takes two steps – estimating click propensities and using them to train unbiased models. Most common methods proposed in the literature for estimating propensities involve some degree of intervention in the live search engine. An alternative approach proposed recently uses an Expectation Maximization (EM) algorithm to estimate propensities by using ranking features for estimating relevances. In this work we propose a novel method to directly estimate propensities which does not use any intervention in live search or rely on modeling relevance. Rather, we take advantage of the fact that the same query-document pair may naturally change ranks over time. This typically occurs for eCommerce search because of change of popularity of items over time, existence of time dependent ranking features, or addition or removal of items to the index (an item getting sold or a new item being listed). However, our method is general and can be applied to any search engine for which the rank of the same document may naturally change over time for the same query. We derive a simple likelihood function that depends on propensities only, and by maximizing the likelihood we are able to get estimates of the propensities. We apply this method to eBay search data to estimate click propensities for web and mobile search and compare these with estimates using the EM method. We also use simulated data to show that the method gives reliable estimates of the «»true»» simulated propensities. Finally, we train an unbiased learning-to-rank model for eBay search using the estimated propensities and show that it outperforms both baselines – one without position bias correction and one with position bias correction using the EM method.

Bounds and Estimates on the Average Edit Distance

**Abstract:**The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. An open problem is the exact value of the average edit distance $\alpha_k(n) n$, between random strings of $n$ characters from an alphabet of a given size $k$. Moreover, while it is known that for increasing $n$, $\alpha_{k}(n)$ approaches a limit $\alpha_{k}$, the exact value of this limit is unknown, for any $k>1$.

This paper presents an upper bound to $\alpha_{k}$ based on the exact computation of some $\alpha_k(n)$ and a lower bound to $\alpha_{k}$ based on combinatorial arguments on edit scripts. Statistical estimates of $\alpha_{k}$ are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes $k$. In particular, for a binary alphabet, the rigorous bounds are $0.1742 \leq \alpha_2 \leq 0.3693$ while the obtained estimate is $\alpha_2 \approx 0.2888$; for a quaternary alphabet, $0.3598 \leq \alpha_4 \leq 0.6318$ and $\alpha_4 \approx 0.5180$. These values improve over the best published results.

COBS: a Compact Bit-Sliced Signature Index

**Abstract:**We present COBS, a compact bit-sliced signature index, which is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. We compare COBS to seven other index software packages on 100 000 microbial DNA samples. COBS’ compact but simple data structure outperforms the other indexes in construction time and query performance with Mantis by Pandey et al. on second place. However, different from Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

Faster Repetition-Aware Compressed Suffix Trees based on Block Trees

**Abstract:**Suffix trees are a fundamental data structure in stringology, but their space usage, though linear, is an important problem in applications. We design and implement a new compressed suffix tree targeted to highly repetitive texts, such as large genomic collections of the same species. Our suffix tree builds on Block Trees, a recent Lempel-Ziv-bounded data structure that captures the repetitiveness of its input. We use Block Trees to compress the topology of the suffix tree, and augment the Block Tree nodes with data that speeds up suffix tree navigation.

Our compressed suffix tree is slightly larger than previous repetition-aware suffix trees based on grammars, but outperforms them in time, often by orders of magnitude. The component that represents the tree topology achieves a speed comparable to that of general-purpose compressed trees, while using 2.3–10 times less space, and might be of independent interest.

Linear Time Maximum Segmentation problems in Column Stream Model

**Abstract:**We study a lossy compression scheme linked to a biological problem about founder reconstruction: The goal in founder reconstruction is to replace a set of strings with a smaller set of founders such that the original connections are maintained as well as possible. A general formulation of this problem is NP-hard, but when limiting to reconstructions that form a segmentation of the input strings, polynomial time solutions exist. We proposed in our earlier work (WABI 2018) a linear time solution to a formulation where minimum segment length was bounded, but it was left open if the same running time can be obtained when the targeted compression level (number of founders) is bounded and lossyness is minimized. This optimization is captured by the Maximum Segmentation problem: Given a threshold M and a set R = {R1,…,Rm} of strings of the same length n, find a minimum cost partition P where for each segment [i,j] in P, the compression level |{Rk[i,j]: 1 <= k <= m}| is bounded from above by M. We give linear time algorithms to solve the problem for two different (compression quality) measures on P: the average length of the intervals of the partition and the length of the minimal interval of the partition. These algorithms make use of positional Burrows--Wheeler transform and the range maximum queue, an extension of range maximum queries to the case where the input string can be operated as a queue. For the latter, we present a new solution that may be of independent interest. The solutions work in a streaming model where one column of the input strings is introduced at a time.

Weighted Shortest Common Supersequence Problem Revisited

**Abstract:**We revisit the Weighted Shortest Common Supersequence (WSCS) problem, that is, the SCS problem on weighted strings, that was introduced by Amir et al. [SPIRE 2011]. A weighted string, also known as a Position Weight Matrix (PWM), is a sequence of probability distributions over the alphabet. In the WSCS problem we are given two weighted strings $W_1$ and $W_2$ and a threshold $1/z$ on probability and are to compute the shortest (standard) string $S$ such that each of $W_1$, $W_2$ matches a subsequence of $S$ (not necessarily the same) with probability at least $1/z$. Amir et al. showed that a log-probability version of this problem is NP-complete.

We present an algorithm that solves the WSCS problem for two weighted strings of length $n$ and over a constant-sized alphabet in $O(n^2\sqrt{z} \log{z})$ time. This matches conditional lower bounds since it is known that WSCS cannot be solved in $O(n^{2-\varepsilon})$ or $O^*(z^{0.5-\varepsilon})$ time unless there is a breakthrough improving upon long-standing upper bounds for fundamental NP-hard problems (CNF-SAT and Subset Sum, respectively).

We also discover a fundamental difference between the WSCS problem and the Weighted Longest Common Subsequence (WLCS) problem that was introduced by Amir et al. [JDA 2010] by showing that WLCS cannot be solved in $n^{O(1)} f(z)$ time, for any function $f(z)$, unless $P=NP$.

Fast, Small, and Simple Document Listing on Repetitive Text Collections

**Abstract:**Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $\ndoc$ distinct documents where a pattern of length $m$ appears in time $\Oh(m+\ndoc \cdot \log n)$. We exploit the repetitiveness of the {\em document array} (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

Polynomial-Delay Enumeration of Maximal Common Subsequences

**Abstract:**A Maximal Common Subsequence (MCS) between two strings X and Y is an inclusion-maximal subsequence of both X and Y. MCSs are a natural generalization of the classical concept of Longest Common Subsequence (LCS), which can be seen as a longest MCS. We study the problem of efficiently listing all the distinct MCSs between two strings. As discussed in the paper, this problem is algorithmically challenging as the same MCS cannot be listed multiple times: for example, dynamic programming [Fraser et al., CPM 1998] incurs in an exponential waste of time, and a recent algorithm for finding an MCS [Sakai, CPM 2018] does not seem to immediately extend to listing. We follow an alternative and novel graph-based approach, proposing the first output-sensitive algorithm for this problem: it takes polynomial time in n per MCS found, where n = max {|X|,|Y|}, with polynomial preprocessing time and space.

An Index for Sequencing Reads Based on The Colored de Bruijn Graph

**Abstract:**We present a succinct index for representing a collection of DNA sequencing reads. Our index is a de Bruijn graph of order k of the set, in which the path of every read is assigned a particular color. We reduce the space usage of the data structure by performing an incomplete coloring, i.e., not all the nodes in the path of a read are colored, and by reusing the same colors in those reads that do not have (k − 1)- substrings in common, whenever it is possible. Additionally, we propose two algorithms that work on top of the index, one for reconstructing the reads from the graph and other for contig assembly. These algorithms exploit the color information to produce better results than uncolored de Bruijn graphs. Experimental results show that our index uses about half the space of the plain representation of the input collection and that more than 99% of the original reads can be reconstructed just from our data structure.

Space-Efficient Merging of Succinct de Bruijn Graphs

**Abstract:**We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds.

Parallel External Memory Wavelet Tree and Wavelet Matrix Construction

**Abstract:**We present the first parallel external memory wavelet tree and matrix construction algorithm. The algorithm’s throughput is nearly the same as the hard disk drives’ throughput, using six cores. We also present the fastest (parallel) semi-external construction algorithms for both wavelet trees and matrices.

Implementing the Topological Model Succinctly

**Abstract:**We show that the topological model, a semantically rich standard to represent GIS data, can be encoded succinctly while efficiently answering a number of topology-related queries. We build on recent succinct planar graph representations so as to encode a model with $m$ edges within $4m+o(m)$ bits and answer various queries relating nodes, edges, and faces in $o(\log\log m)$ time, or any time in $\omega(\log m)$ for a few complex ones.

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

**Abstract:**We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $\Sigma$ and parameterized alphabet $\Pi$, our algorithm runs in $O(n\pi)$ time and $O(n)$ words of space, where $\pi$ is the number of distinct symbols of $\Pi$ in the string.

Approximation ratios of RePair, LongestMatch and Greedy on unary strings

**Abstract:**A grammar-based compressor computes for a given input w a context-free grammar that produces only w. So-called global grammar-based compressors (RePair, LongestMatch and Greedy) achieve impressive practical compression results, but the recursive character of those algorithms makes it hard to achieve strong theoretical results. To this end, this paper studies the approximation ratio of those algorithms for unary input strings, which is strongly related to the field of addition chains. We show that in this setting, RePair and LongestMatch produce equal size grammars that are by a factor of at most log2(3) larger than a smallest grammar. We also provide a matching lower bound. The main result of this paper is a new lower bound for Greedy of 1.348…, which improves the best known lower bound for arbitrary (not necessarily unary) input strings.

Run-Length Encoding in a Finite Universe

**Abstract:**Text compression schemes and succinct data structures usually combine sophisticated probability models with basic coding methods whose average codeword length closely match the entropy of known distributions. In the frequent case where basic coding represents run-lengths of outcomes that have probability p, i.e. geometric distribution Pr(i)=pⁱ(1-p), a Golomb code is an optimal instantaneous code, which has the additional advantage that codewords can be computed using only an integer parameter calculated from p, without need for a large or sophisticated data structure. Golomb coding does not, however, gracefully handle the case where run-lengths are bounded by a known integer n. In this case, codewords allocated for the case i>n are wasted. While negligible for large n, this makes Golomb coding unattractive in situations where n is recurrently small, e.g., when representing many short lists of integers drawn from limited ranges, or when the range of n is narrowed down by a recursive algorithm.

We address the problem of choosing a code for this case, considering efficiency from both information-theoretic and computational perspectives, and arrive at a simple code that allows computing a codeword using only O(1) simple computer operations and O(1) machine words. We demonstrate experimentally that the resulting representation length is very close (equal in a majority of tested cases) to the optimal Huffman code, to the extent that the expected difference is practically negligible. We describe efficient branch-free implementation of encoding and decoding.

On Longest Common Property Preserved Substring Queries

**Abstract:**We revisit the problem of longest common property preserving substring queries first proposed by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings with total length $n$ that can be pre-processed, so that, when given a query string $y$ and a positive integer $k’\leq k$, we would like to answer the longest substring of $y$ that satisfies some specific property and is common to at least $k’$ strings in $X$. Ayad et al. considered the longest square-free substring in the on-line setting, and the longest periodic and palindromic substring in the off-line setting. In this paper, we give efficient solutions in the on-line setting for finding the longest common square, periodic, palindromic, and Lyndon substrings. More precisely, we show that $X$ can be pre-processed in $O(n)$ time to build a data structure of $O(n)$ size so that queries can be answered in $O(|y|\log\sigma)$ time and $O(1)$ working space, where $\sigma$ is the size of the alphabet, when the common substring must be a square, a periodic substring, a palindrome, or a Lyndon word.

Fast Identification of Heavy Hitters by Cached and Packed Group Testing

**Abstract:**The $\epsilon$-approximate $\phi$-heavy hitters problem is, for any element from some universe $\mathbb{U}=[0..n)$, to maintain its frequency under an arbitrary data stream of form $(x_i, \Delta_i)\in\mathbb{U}\times\mathbb{Z}$ that changes the frequency of $x_i$ by $\Delta_i$, such that one can report every element with frequency more than $\phi N$ and no element with frequency no more than $(\phi-\epsilon)N$ for $N=\sum_i \Delta_i$ and prespecified $\epsilon, \phi\in\mathbb{R}$. To solve this problem in small space, Cormode and Muthukrishnan~(ACM TODS, 2005) have proposed an $O(\tau\epsilon^{-1}\lg{n})$-space probabilistic data structure with good practical performance, where $\tau=\lg{(1/(\delta\phi))}$ for any failure probability $\delta\in\mathbb{R}$. In this paper, we improve its output time from $O(\tau\epsilon^{-1}(\lg{n}+\tau))$ to $O(\tau^2\epsilon^{-1})$ for arbitrary updates ($\Delta_i\in\mathbb{Z}$ for any~$i$) and its update time from $O(\tau\lg{n})$ to $O(\tau)$ for unitary updates ($\Delta_i\in\{\pm 1\}$ for any~$i$) with the same space and output guarantee, by removing $\lg{n}$ terms that are basically application-specific (e.g.~$64$ in tracking each IPv6 address) and not tunable unlike $\delta$, $\epsilon$, and $\phi$.

Space- and Time-Efficient Storage of LiDAR Point Clouds

**Abstract:**LiDAR devices obtain a 3D representation of a space. Due to the large size of the resulting datasets, there already exist storage methods that use compression and present some properties that resemble those of compact data structures. Specifically, LAZ format allows accesses to a given datum or portion of the data without having to decompress the whole dataset and provides indexation of the stored data. However, LAZ format still have some drawbacks that should be faced.

In this work, we propose a new compact data structure for the representation of a cloud of LiDAR points that supports efficient queries, providing indexing capabilities that are superior to those of LAZ format.

Inducing the Lyndon Array

**Abstract:**In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in O(n) time using \sigma + O(1) words of working space, where n is the length of the text and \sigma is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use O(n) words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice.

Searching Runs in Streams

**Abstract:**Runs, or maximal periodic substrings, capture the whole picture of periodic fragments in a string. Computing all runs of a string in the usual RAM model is a well-studied problem. We approach this problem in the streaming model, where input symbols arrive one by one and the available space is sublinear. We show that no streaming algorithm can identify all squares, and hence all runs, in a string. Our main contribution is a Monte Carlo algorithm which approximates the set of runs in the length-$n$ input stream $S$. Given the error parameter $\varepsilon=\varepsilon(n)\in(0,\frac12]$, the algorithm reports a set of periodic substrings such that for each run of exponent $\beta\ge 2+\varepsilon$ in $S$ a single its substring is reported, with the same period and the exponent at least $\beta-\varepsilon$; for runs of smaller exponent, zero or one substring of exponent at least 2 is reported. The algorithm uses $O(\frac{\log^2 n}{\varepsilon})$ words of memory and spends $O(\log n)$ operations, including dictionary operations, per symbol.

Compact Data Structures for Shortest Unique Substring Queries

**Abstract:**Given a string T of length n, a substring u = T[i.. j] of T is called a shortest unique substring (SUS) for an interval [s, t] if (a) u occurs exactly once in T, (b) u contains the interval [s, t] (i.e. i \leq s \leq t \leq j), and (c) every substring v of T with |v| < |u| containing [s, t] occurs at least twice in T. Given a query interval [s, t] \subset [1, n], the interval SUS problem is to output all the SUSs for the interval [s, t]. In this article, we propose a 4n + o(n) bits data structure answering an interval SUS query in output-sensitive O(occ) time, where occ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for s = t. Here, we propose a \lceil (log2 3 + 1)n \rceil + o(n) bits data structure answering a point SUS query in the same output-sensitive time.

Fast Cartesian Tree Matching

**Abstract:**Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching.

# Short Papers Accepted

Range Shortest Unique Substring Queries

**Abstract:**Let $\T[1,n]$ be a string of length $n$ and $\T[i,j]$ be the substring of $\T$ starting at position $i$ and ending at position $j$. A substring $\T[i,j]$ of $\T$ is a repeat if it occurs more than once in $\T$, otherwise it is a unique substring of $\T$. Repeats and unique substrings are of great interest in computational biology and in information retrieval. Given string $\T$ as input, the {\it Shortest Unique Substring} problem is to find a shortest substring of $\T$ that does not occur elsewhere in $\T$. In this paper we introduce the range variant of this problem, which we call the {\it Range Shortest Unique Substring} problem. The task is to construct a data structure over $\T$ answering the following type of online queries efficiently. Given a range $[\alpha, \beta]$, return a shortest substring $\T[i,j]$ of $\T$ with exactly one occurrence in $[\alpha, \beta]$. We present an $\cO(n\log n)$-word data structure with $\cO(\log_w n)$ query time, where $w=\Omega(\log n)$ is the word size. Our construction is based on a non-trivial reduction allowing us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018].

SACABench: Benchmarking Suffix Array Construction

**Abstract:**We present a practical comparison of suffix array construction algorithms on modern hardware. The benchmark is conducted using our new benchmark framework SACABench, which allows for an easy deployment of publicly available implementations, simple plotting of the results, and straight forward support to include new construction algorithms. We use the framework to develop a construction algorithm running on the GPU that is competitive with the fastest parallel algorithm in our test environment.

An Optimal Algorithm to Find Champions of Tournament Graphs

**Abstract:**A tournament graph $T = (V, E )$ is an oriented complete graph, which can be used to model a round-robin tournament between $n$ players. In this short paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. Our goal is to solve the problem by minimizing the number of arc lookups, i.e., the number of matches played. We prove that finding a champion requires $\Omega(\ell n)$ comparisons, where $\ell$ is the number of matches lost by the champion, and we present a deterministic algorithm that matches this lower bound. Solving this problem has important implications on several Information Retrieval applications including Web search, conversational AI, machine translation, question answering, recommender systems, etc.

A new Linear-time Algorithm for Centroid Decomposition

**Abstract:**The centroid of a tree is a node that, when removed, breaks the tree in subtrees of size at most half of that of the original tree. By recursing this procedure on the subtrees, one obtains the centroid decomposition of the tree, also known as centroid tree. The centroid tree has logarithmic height and its construction is a powerful pre-processing step in several tree-processing algorithms. The folklore recursive algorithm for computing the centroid tree runs in O(n log n) time. To the best of our knowledge, the only result claiming O(n) time is unpublished and relies on (dynamic) heavy path decomposition of the original tree. In this short paper, we describe a new simple and practical linear-time algorithm for the problem based on the idea of applying the folklore algorithm to a suitable decomposition of the original tree.

Minimal Absent Words in Rooted and Unrooted Trees

**Abstract:**We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet $\Sigma$ of cardinality $\sigma$. We show that the set $\MAW(T)$ of minimal absent words of a rooted (resp.~unrooted) tree $T$ with $n$ nodes has cardinality $O(n\sigma)$ (resp.~$O(n^{2}\sigma)$), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp.~unrooted) tree in output-sensitive time $O(n+|\MAW(T)|)$ (resp.~$O(n^{2}+|\MAW(T)|)$.

Rpair: Scaling up RePair with Rsync

**Abstract:**Data compression is a powerful tool for managing massive but repetitive datasets, especially compression schemes supporting computation on the data without decompression. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be applied to the original dataset quickly and in external memory. In this paper we show how we can preprocess massive repetitive datasets with a variant of the popular tool Rsync, such that afterwards we can apply RePair and other grammar-based compressors using much less time and space. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds for grammar-based compression, and finally give experimental evidence that our approach is competitive in practice.

On the computation of longest previous non-overlapping factors

**Abstract:**The f-factorization of a string is similar to the well-known Lempel-Ziv (LZ) factorization, but differs from it in that the factors must be non-overlapping. There are two linear time algorithms that compute the f-factorization. Both of them compute the array of longest previous non-overlapping factors (LPnF-array), from which the f-factorization can easily be derived. In this paper, we present a simple algorithm that computes the LPnF-array from the LPF-array and an array prevOcc that stores positions of previous occurrences of LZ-factors. The algorithm has a linear worst-case time complexity if prevOcc contains leftmost positions. Moreover, we provide an algorithm that computes the f-factorization directly. Experiments show that our first algorithm (combined with efficient LPF-algorithms) is the fastest and our second algorithm is the most space efficient method to compute the f-factorization.

Network-Based Pooling for Topic Modeling on Microblog Content

**Abstract:**This paper investigates a new tweet-pooling method based on network structures associated with Twitter content. Using a standard formulation of the well-known Latent Dirichlet Allocation (LDA) topic model, we trained various models using different tweet-pooling schemes on three diverse Twitter datasets. Tweet-pooling schemes were created based on mention/reply relationships between tweets and Twitter users, with several (non-networked) established methods also tested as a comparison. Results show that pooling tweets using network information gives better topic coherence and clustering performance than other pooling schemes, on the majority of datasets tested. Our findings contribute to an improved methodology for topic modeling with Twitter content.

# Social Programme: Guided visit and Conference Banquet

[Additional info will be included here soon]

On Monday 7th, there will be a Guided tour around the city that will start at «Plaza del Azoguejo» (main view of the Aqueduct) at 17.30h, and will end at «Plaza Mayor» around 20:30h. We will visit the most representative areas/monuments/places of the city including among others the Aqueduct, the Jewish quarter, a **guided visit to the Alcázar** (the beautiful fortress in the North-East side of the city), the Cathedral, and Plaza Mayor.

On Tuesday 8th, we will enjoy the conference banquet at «Restaurante Hidalgo» at 21:00h