The number of biological sequences in the genomic databases, such as the GenBank, have exponentially increased during the past decade. Sequence retrieval systems are required to quickly and efficiently find sequences, that are related to a query sequence. Several comparison algorithms that generally rely upon the existence of local string similarities between the query and the database sequences have been widely utilized and accepted as the basis for bio-sequence retrieval from DNA sequences databases. In this paper we describe a new method for sequence comparison based on k-mer word frequency profiles. In this algorithm, the distribution of the k-mer words found on the two sequences, defined as sequences' profiles, are treated as the signatures of sequences. This representation enables us to perform a comparison of sequence similarity using Shannon's entropy based divergence measures. The profile based search of the primate section of GenBank (GB-PRI, comprising of approximately 114,000 DNA sequences) was performed using this approach. The results obtained have established the significance and validity of a profile based genomic sequence retrieval algorithm.
/lp/association-for-computing-machinery/profile-based-methods-for-genomic-sequence-retrieval-4TXsBuEBrx