Computer Science and Information Engineering, World Congress on (2009)

Los Angeles, California USA

Mar. 31, 2009 to Apr. 2, 2009

ISBN: 978-0-7695-3507-4

pp: 106-111

ABSTRACT

Hashing long strings is difficult, especially when the alphabet is small. Chess and GO game board hashing has almost always been accomplished by using (letter position) pairs to index into a table of random numbers which are exclusiveor’d to create the hash value. The table of random numbers can be a huge source of different hash functions by varying any bit of any random number. Algorithms are developed here that can find hashes that are perfect, minimal, and even ordered for very large cases. The Human Genome is a great source of small alphabet strings that are long, so it is used as a test case here. An algorithm is presented that can solve for an ordered minimal perfect hash for the Genome. It can also solve for the lesser cases of minimal perfect and perfect hash at higher speed. A statistical criterion is derived for obtaining the ordered minimal perfect hash with high probability. The algorithm and the statistical criterion lead to a duplicate finding algorithm that might prove to be fastest for important cases.

INDEX TERMS

CITATION

A. L. Zobrist, "Ordered Minimal Perfect Hash of the Human Genome and Implications for Duplicate Finding,"

*2009 WRI World Congress on Computer Science and Information Engineering, CSIE(CSIE)*, Los Angeles, CA, 2009, pp. 106-111.

doi:10.1109/CSIE.2009.1070

CITATIONS