kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations

Authors

Khan, J.; Patro, R.; Pandey, P.

Abstract

Motivation: Hash tables are fundamental to computational genomics, where keys are often k-mers---fixed-length substrings that exhibit a ''streaming'' property: consecutive k-mers share k - 1 nucleotides and are processed in order. Existing static data structures exploit this locality but cannot support dynamic updates, while state-of-the-art concurrent hash tables support dynamic operations but ignore k-mer locality. Results: We introduce kache-hash, the first dynamic, concurrent, and resizable hash table that exploits k-mer locality. kache-hash builds on Iceberg hashing---a multi-level design achieving stability and low associativity---but replaces generic hashing with minimizer-based hashing, ensuring that consecutive k-mers map to the same buckets. This keeps frequently accessed buckets cache-resident during streaming operations. On the human genome, kache-hash achieves 1.58---2.62x higher insertion throughput than IcebergHT and up to 6.1x higher query throughput, while incurring 7.39x fewer cache misses. kache-hash scales near-linearly to 16 threads and supports dynamic resizing without sacrificing locality. Our theoretical analysis proves that streaming k-mer operations achieve O(1/r) amortized cache misses per operation, where r is the minimizer run length, explaining the substantial performance gains over general-purpose hash tables. Availability: kache-hash is implemented in C++20 and is available at https://github.com/jamshed/kache-hash. Contact: [email protected] Supplementary information: Supplementary material are available for this manuscript.

Follow Us on

0 comments

Add comment