Why I Avoid Using Hash Tables
Most of the programming languages I have used provide two dictionary types: one implemented with a hash table, and another implemented with a search tree. Here are some I’ve encountered:
It’s well established that hash tables are faster, and all the evidence I’ve seen agrees. So you may be surprised to learn that when I write code, I avoid hash tables. Instead, I use search trees and I only switch to hash tables if I have evidence (a profiler’s report) that switching is necessary. My reasons follow.
Hash tables summon chaos into your program.
Have you ever tried debugging an issue that happens 1 out of 100 tries? Or only happens in production, but not test environments? Or only happens on one machine that you don’t have access to? If you haven’t, then you must be new here; welcome to software engineering.
These rarely occurring issues are difficult and frustrating to track down and debug. They’re the result of non-deterministic behavior, which I’m calling chaos because it’s more sensational and will hopefully attract more readers.
There are lots of sources of chaos. Some examples are the computer’s clock, input and output, memory allocation patterns, and hash tables.
How do hash tables summon chaos? Consider the following code which populates two hash tables with the same Products, and then tests the hash tables for equality:
record struct Product(int ProductId, string VendorName);
var a = new Dictionary<Product, decimal> {
{ new Product { ProductId = 1, VendorName = "Contoso" }, 100m },
{ new Product { ProductId = 2, VendorName = "Fabrikam" }, 150m }
};
var b = new Dictionary<Product, decimal> {
{ new Product { ProductId = 2, VendorName = "Fabrikam" }, 150m },
{ new Product { ProductId = 1, VendorName = "Contoso" }, 100m }
};
Console.WriteLine(".Equals() {0}", a.Equals(b));
Console.WriteLine(".SequenceEquals() {0}", a.SequenceEqual(b));
What will this code print?
The correct answers are Line 1: False
and Line 2: impossible to know.
.Equals()
always returns False
because C# is C#. It’s not what I hoped for, but at least it’s consistent.
.SequenceEquals()
enumerates all the items in the dictionaries and compares each element. They may be enumerated in the same order, or they may not be. When they’re enumerated in the same order, the code will print True
. When they’re not, the code will print False
. The docs say:
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair<TKey,TValue> structure representing a value and its key. The order in which the items are returned is undefined.
Undefined behavior is chaos. I don’t want chaos in my software. The better I can predict how my code will behave, the less time I’ll spend debugging it.
In practice, how does this chaos cause harm? Imagine I write a function and a test. The function dumps a hash table to a CSV file, and the test confirms the resulting CSV file looks as expected.
static void WritePricesToCsv(Dictionary<Product, decimal> prices,
string filename)
{
using (var writer = new StreamWriter(filename))
{
foreach (var price in prices)
{
writer.WriteLine("{0},{1},{2}", price.Key.ProductId,
price.Key.VendorName, price.Value);
}
}
}
[Test]
public void TestWritePricesToCsv()
{
// Arrange
var prices = new Dictionary<Product, decimal>
{
{ new Product { ProductId = 1, VendorName = "Vendor1" }, 100m },
{ new Product { ProductId = 2, VendorName = "Vendor2" }, 200m }
};
var filename = "test.csv";
var expectedContent = "1,Vendor1,100\r\n2,Vendor2,200\r\n";
// Act
WritePricesToCsv(prices, filename);
// Assert
var actualContent = File.ReadAllText(filename);
Assert.AreEqual(expectedContent, actualContent);
// Cleanup
File.Delete(filename);
}
I merge the code and the test passes for the next 10 years. Then I leave the project, some poor sap takes ownership of my code and the test starts failing because the foreach
loop spontaneously starts listing the products in reverse order. It’s 100% allowed to do that because the documentation says the order of enumeration is undefined. To make matters worse, the test happens to fail only when the test runs on the CI/CD machine. It always passes on the poor sap’s machine. It’ll probably take the poor sap a day to find the issue. What a frustrating waste of time.
So sometimes some tests fail. That’s not the end of the world. But it can get much worse. The next example is a lesson I learned the hard way.
Imagine I’m building a microservice, and my microservice calls a backend Price Server periodically to fetch the latest prices.
The code looks something like:
void RefreshPrices(Dictionary<Product, decimal> prices) {
var request = new FetchPricesRequest();
foreach (var price in prices) {
request.Add(price.Key);
}
var response = FetchPrices(request);
...
Now imagine there’s a million instances of my microservice all fetching requests from the same Price Server. Because the load is so heavy, I add a caching proxy between my microservice and the Price Server.
The proxy serves 99% of the requests from its cache. If all the requests missed the cache and hit the Price Server directly, they’d overwhelm it.
Everything is running smoothly in production. Then, one day, for no apparent reason, the caching proxy which was serving 99% of requests from its cache instead starts serving 2% of requests from its cache and forwarding the other 98% of requests to the Price Server. The Price Server is overwhelmed and crashes.
Why did the caching proxy suddenly fail to find requests in its cache? Because even though all the FetchPricesRequest
s contained the same Products, they contained them in different orders. To the proxy, requests with the Products in different orders are different requests. A request with products [a, b] can’t be served from a cache entry with products [b, a]. The proxy server must forward the request to the Prices Server.
Had I used a SortedDictionary
(search tree) instead of Dictionary
(hash table), the failures above would never have happened. Requests with the same lists of products would always carry those products in the same order, and therefore be served from the cache.
Hash tables are vulnerable to denial of service (DoS) attacks.
One type of DoS attack on hash tables is known as a “collision attack.”
Hash tables use a hash function to map keys to indices in an array (buckets). Ideally, each unique key should map to a unique index, but due to the finite size of the array and the practically infinite number of possible keys, collisions happen.
A collision happens when two or more keys are hashed to the same index. The hash table in the diagram above has a single collision at index 4 between the items C and D.
In the worst case scenario, all the keys hash to the same index. The performance of a hash table look up degrades from O(1) constant time to O(n) linear time. The hash table performs no better than a linear search in an array. The diagram below shows a hash table in this bad state.
An attacker attempting a DoS attack tries to get my hash table into this worst case scenario.
Imagine my web service allows customers to define new products. They provide the product name, and I keep the product information in a hash table with the product name as the key. An attacker can contrive a list of one million product names that all yield the same index when passed to my hash function. It will take some guess work for the attacker to figure out which hashing algorithm I use for the string. Lucky for the attacker, I used the default string hash function built into C#, which the attacker figures out very quickly. The attacker then begins adding one million products with colliding names to my web service. This causes my web service to burn 100% CPU doing linear searches and insertions into the hash table. My web service is too busy to serve my other customers’ requests. The DoS attack succeeds.
Had I chosen a search tree instead of a hash table, this attack would have had minimal effect on my service. The performance of a look up in a search tree never exceeds O(log(n)). My web service would have continued to serve other customers while the attacker defined a million new products.
Hash tables leak memory.
Hash tables leak memory when you remove items from them. To demonstrate, I wrote some code that fills a hash table with 100,000 items, then clears the table so it contains 0 items, and prints the size of heap memory after each step.
void FillPrices(IDictionary<Product, decimal> prices)
{
// Print the initial heap size.
var startingMemory = GC.GetTotalMemory(true) / 1024;
Console.WriteLine("initial memory: {0}", startingMemory);
var now = DateTime.Now;
// Fill the prices dictionary with 100,000 products.
for (int i = 0; i < 100000; i++)
{
prices.Add(new Product { ProductId = i, VendorName = RandomName() }, i);
}
// Print the heap size after filling the dictionary.
var memory = GC.GetTotalMemory(true) / 1024;
Console.WriteLine("memory for {0} items: +{1}", prices.Count,
memory - startingMemory);
// Clear the dictionary and print the heap size.
prices.Clear();
var elapsed = DateTime.Now - now;
memory = GC.GetTotalMemory(true) / 1024;
Console.WriteLine("memory for {0} items: +{1}",
prices.Count, memory - startingMemory);
// Refill it one more time and print the heap size.
for (int i = 0; i < 100000; i++)
{
prices.Add(new Product { ProductId = i, VendorName = RandomName() }, i);
}
memory = GC.GetTotalMemory(true) / 1024;
Console.WriteLine("memory for {0} items: +{1}", prices.Count,
memory - startingMemory);
// Print the time it took to fill and clear the dictionary.
Console.WriteLine("elapsed: {0}", elapsed);
}
I ran the code and here’s the output:
initial memory: 73
memory for 100000 items: +10658
memory for 0 items: +6752
memory for 100000 items: +10659
elapsed: 00:00:00.1015928
I’m not surprised to see the hash table consumes 10 MB when it contains 100,000 items. I am surprised to see the hash table still consuming 6 MB when it contains 0 items. To be fair, the memory isn’t lost forever. It does get reused when more items are added to the Dictionary. But will my application actually refill the dictionary or just waste space with an empty dictionary? Answering that question requires a thorough investigation of all the code paths that touch the dictionary, and I’m lazy and I don’t want to do that.
Compare this to performing the same operations on a SortedDictionary
(search tree) instead of a Dictionary
:
initial memory: 10732
memory for 100000 items: +10937
memory for 0 items: +0
memory for 100000 items: +10937
elapsed: 00:00:00.2382486
After clearing the SortedDictionary
, the garbage collector was able to reclaim all memory. I frequently work on web assembly applications where memory is constrained, so SortedDictionary’s
tidy memory habits are much more attractive to me.
Why do hash tables that contain 0 items consume so much memory? Earlier I described how hash tables use a hash function to map keys to indices in an array (buckets). As items are added to the hash table, buckets are added and the array grows. As items are removed from the hash table, buckets are dropped but the array never shrinks. In theory, the hash table could shrink the array, but in practice, I haven’t seen that happen. 100 Go Mistakes and How to Avoid Them specifically calls out this issue as #28. Have you seen a hash table actually shrink its array? Please leave a comment and tell me where.
But hash tables are faster!
Yes, they are, and the elapsed
lines in the print outs above support that claim.
In the applications I’ve worked on, the slightly slower performance of a search tree has never been visible in a profiler. Something else, a compression algorithm for example, has always been burning up so much CPU that the performance of the search tree had no impact. Your mileage may vary.
Conclusion
If I’m trying to win a contest on leetcode.com, I’ll choose a hash table because it’s faster. But if I’m writing code in any other context, I’ll choose a search tree so I can avoid the problems discussed above. A search tree always enumerates its items in sorted order, isn’t vulnerable to DoS attacks, and cleans up memory when items are removed.
There are ways to mitigate these problems when using a hash table, but those mitigations require diligence. I have a limited supply of diligence in a day, and I’d rather spend my diligence solving my customers’ problems than spend it solving problems created by my chosen data structure.
I hope this article helps you make more informed decisions when choosing between hash tables and search trees. If you think it will, please click the clap icon below.
The source code I used to measure memory and performance is available on GitHub.