TY - GEN
T1 - How caching affects hashing
AU - Heileman, Gregory L.
AU - Luo, Wenbin
PY - 2005
Y1 - 2005
N2 - A number of recent papers have considered the influence of modern computer memory hierarchies on the performance of hashing algorithms [1, 2, 3]. Motivation for these papers is drawn from recent technology trends that have produced an ever-widening gap between the speed of CPUs and the latency of dynamic random access memories. The result is an emerging computing folklore which contends that inferior hash functions, in terms of the number of collisions they produce, may in fact load to superior performance because these collisions mainly occur in cache rather than main memory. This line of reasoning is the antithesis of that used to justify most of the improvements that have been proposed for open address hashing over the past forty years. Such improvements have generally sought to minimize collisions by spreading data elements more randomly through the hash table. Indeed the name "hashing" itself is meant to convey this notion [12], However, the very act of spreading the data elements throughout the table negatively impacts their degree of spatial locality in computer memory, thereby increasing the likelihood of cache misses during long probe sequences. In this paper we study the performance trade-offs that exist when implementing open address hash functions on contemporary computers. Experimental analyses are reported that make use of a variety of different hash functions, ranging from linear probing to highly "chaotic" forms of double hashing, using data sets that are justified through information-theoretic analyses. Our results, clarify and extend those in a number of recently published papers, show that the savings gained by reducing collisions (and therefore probe sequence lengths) usually compensate for any increase in cache misses. That is, due to primary clustering, the variability in the performance of linear probing can be extreme, in the worst case producing running times orders of magni tude worse than double hashing, and in the best case yielding running times only slightly better than double hashing.
AB - A number of recent papers have considered the influence of modern computer memory hierarchies on the performance of hashing algorithms [1, 2, 3]. Motivation for these papers is drawn from recent technology trends that have produced an ever-widening gap between the speed of CPUs and the latency of dynamic random access memories. The result is an emerging computing folklore which contends that inferior hash functions, in terms of the number of collisions they produce, may in fact load to superior performance because these collisions mainly occur in cache rather than main memory. This line of reasoning is the antithesis of that used to justify most of the improvements that have been proposed for open address hashing over the past forty years. Such improvements have generally sought to minimize collisions by spreading data elements more randomly through the hash table. Indeed the name "hashing" itself is meant to convey this notion [12], However, the very act of spreading the data elements throughout the table negatively impacts their degree of spatial locality in computer memory, thereby increasing the likelihood of cache misses during long probe sequences. In this paper we study the performance trade-offs that exist when implementing open address hash functions on contemporary computers. Experimental analyses are reported that make use of a variety of different hash functions, ranging from linear probing to highly "chaotic" forms of double hashing, using data sets that are justified through information-theoretic analyses. Our results, clarify and extend those in a number of recently published papers, show that the savings gained by reducing collisions (and therefore probe sequence lengths) usually compensate for any increase in cache misses. That is, due to primary clustering, the variability in the performance of linear probing can be extreme, in the worst case producing running times orders of magni tude worse than double hashing, and in the best case yielding running times only slightly better than double hashing.
UR - http://www.scopus.com/inward/record.url?scp=32144436736&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=32144436736&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:32144436736
SN - 0898715962
SN - 9780898715965
T3 - Proceedings of the Seventh Workshop on Algorithm Engineering and Experiments and the Second Workshop on Analytic Algorithms and Combinatorics
SP - 141
EP - 154
BT - Proceedings of the Seventh Workshop on Algorithm Engineering and Experiments and the Second Workshop on Analytic Algorithms and Combinatorics
A2 - Demetrescu, C.
A2 - Sedgewick, R.
A2 - Tamassia, R.
T2 - Seventh Workshop on Algorithm Engineering and Experiments and the Second Workshop on Analytic Algorithms and Combinatorics
Y2 - 22 January 2005 through 22 January 2005
ER -