Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A data structure for a sequence of string accesses in external memory

A data structure for a sequence of string accesses in external memory We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S 1 , …, S n , we aim at supporting a search sequence for m not necessarily distinct strings T 1 , T 2 , …, T m , as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se . The search for the whole sequence T 1 , T 2 , …, T m can be done in an expected number of I/Os: O (∑ j =1 m | T j |/ B + ∑ i =1 n n ( n i log B m / n i )), where each T j may or may not be present in the dictionary, and n i is the number of times S i is queried (i.e., the number of T j s equal to S i ). Moreover, inserting or deleting a string S i takes an expected amortized number O (| S i |/ B + log B n ) of I/Os. The term ∑ j =1 m | T j |/ B in the search formula is a lower bound for reading the input, and the term ∑ i =1 n n i log B m / n i (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access , as compared to Sleator and Tarjan's classical theorem for numerical dictionaries Sleator and Tarjan 1985. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Algorithms (TALG) Association for Computing Machinery

A data structure for a sequence of string accesses in external memory

Loading next page...
 
/lp/association-for-computing-machinery/a-data-structure-for-a-sequence-of-string-accesses-in-external-memory-D9zYjadEST

References (34)

Publisher
Association for Computing Machinery
Copyright
The ACM Portal is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
Subject
Pattern matching
ISSN
1549-6325
DOI
10.1145/1186810.1186816
Publisher site
See Article on Publisher Site

Abstract

We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S 1 , …, S n , we aim at supporting a search sequence for m not necessarily distinct strings T 1 , T 2 , …, T m , as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se . The search for the whole sequence T 1 , T 2 , …, T m can be done in an expected number of I/Os: O (∑ j =1 m | T j |/ B + ∑ i =1 n n ( n i log B m / n i )), where each T j may or may not be present in the dictionary, and n i is the number of times S i is queried (i.e., the number of T j s equal to S i ). Moreover, inserting or deleting a string S i takes an expected amortized number O (| S i |/ B + log B n ) of I/Os. The term ∑ j =1 m | T j |/ B in the search formula is a lower bound for reading the input, and the term ∑ i =1 n n i log B m / n i (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access , as compared to Sleator and Tarjan's classical theorem for numerical dictionaries Sleator and Tarjan 1985. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.

Journal

ACM Transactions on Algorithms (TALG)Association for Computing Machinery

Published: Feb 1, 2007

There are no references for this article.