The Sequence Memoizer By Frank Wood, Jan Gasthaus, Cédric Archambeau, Lancelot James, and Yee Whye Teh Abstract Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, realworld sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated. 1. intRoDuction It is an age-old quest to predict what comes next in sequences. Fortunes have been made and lost on the success and failure of such predictions. Heads or tails? Will the stock market go up by 5% tomorrow? Is the next card drawn from the deck going to be an ace? Does a particular sequence of nucleotides appear more often then usual in a DNA sequence? In a sentence, is the word that follows the
/lp/association-for-computing-machinery/the-sequence-memoizer-O0uprcmqxa