There are some misconceptions about Object.GetHashCode(). Two of them are quite prevalent:
- The implementation of Object.GetHashCode() derives the hash value from the reference (as a memory pointer value) of the instance
- The return value of Object.GetHashCode() is unique to the instance. No other instances will have the same hash value.
These two claims may look to be acceptable at first sight. Let’s examine deeper.
Reference as hash value
This appears to be a good idea. Only one instance can exist on a given memory address, so different instances get different hash codes. What makes this idea inappropriate under .NET is the garbage collector. It defragments the memory by moving instances around – making our imaginary hash value volatile.
GetHashCode() as unique identifier
Why can’t everybody have a different birthday? Because there are only 365 (366) days in a year but more people live on Earth. GetHashCode() returns a 32 bit value, which means 4 billion different hash codes. It is a big number, but software can easily generate more instances. This means that 32 bit is not enough to ensure uniqueness. But it is not even required. Hash value based algorithms are designed to handle hash collisions – or at least they should be. For example the algorithm behind Dictionary or HashSet calls the Equals() method in case of a collision to distinguish instances.
The Truth
So what is the result of the default implementation of GetHashCode()? The short answer is: a pseudo random number generated by a simple function. When Object.GetHashCode() is called on an instance, this function generates a number. Then the number gets stored in a special location associated with that instance. This ensures that every call on the same instance returns the same number.
The Proof – predict next hash code
To show that it is not just a new myth, we are going to write a small program predicting the result of the next Object.GetHashCode() call. To achieve this, the same pseudo random algorithm must be implemented as .NET framework has built in.
Breaking the Random Generator
Pseudo random generators are targets of cryptographical attacks because ephemeral keys are generated using these functions. If one knows the state and the algorithm of a pseudo random function, he will be able to calculate the key material of a TLS communication making possible to decipher it real time.
The importance of this field was the driver for the development of sophisticated methods to reverse engineer a random function with its state. We could use such a method to decipher the random function behind GetHashCode(), too.
Before someone gets excited, I have to admit we will not play Neo. We can simply check the SSCLI source code for the implementation. The important parts are around sscli20\clr\src\vm\threads.cpp 1752. line and threads.h 942. line.
The following code implements the same algorithm:
public static class HashCodeOracle { // More independent generators work in the code; every thread // has its own. The following number is the common seed value: private const uint coreSeed = 123456789; // When we start using the random generator of GetHashCode(), // we do not know whether CLR has already used it or not. // Therefore our solution must be synchronized. First, get a // random from GetHasCode(), then try to find that value // iterating our random generator. The following value gives a // limit to the iteration to avoid infinite loop: private const uint maxSyncTrials = 1000; // CLR stores hash values in a special area along with other values. // These other values require six bits private const int syncFlagCount = 6; // Every thread has its own state for random generation, so // thread local store must be used. ThreadLocal<T> calls // the provided method to calculate initial random state. private static ThreadLocal<uint> state = new ThreadLocal<uint>( () => GetInitStateForThread()); private static uint GetInitStateForThread() { var initState = CalculateThreadSeed(); // Synchronize with the state of the current thread. // The following "algorithm" is not adequate in real // environment but ok for our test. var syncObject = new Object(); var syncValue = syncObject.GetHashCode(); var syncIterations = 0; while (++syncIterations < maxSyncTrials) { initState = IterateState(initState); // We've got the state!! if ((initState >> syncFlagCount) == syncValue) { return initState; } // if } // while throw new NotSupportedException("Could not sync on current thread"); } // GetInitStateForThread() private static uint CalculateThreadSeed() { // Every thread starts from the same value var threadSeed = coreSeed; // The following loop works a bit different way than the // original in CLR, so it might happen to get a wrong seed. var managedThreadId = Thread.CurrentThread.ManagedThreadId; for (int i = 0; i < managedThreadId; i++) { threadSeed = threadSeed * 1566083941 + 1; } // for i return threadSeed; } // CalculateThreadSeed() // The pseudo random generator function. private static uint IterateState(uint currentState) { var managedThreadId = (uint)Thread.CurrentThread.ManagedThreadId; return currentState * (managedThreadId * 4u + 5u) + 1u; } // IterateState() // This gives the next hash code. It calls the // random function to provide a new value, then // clears the upper bits by a bit shift. public static int Next() { int result = 0; do { var currentState = state.Value; var newState = IterateState(currentState); state.Value = newState; result = (int)(newState >> syncFlagCount); } while (result == 0); return result; } // Next() } // class HashCodeOracle
The code above uses a thread dependent generator. So the implementation uses a thread local variable to calculate and maintain a seed value. The class can be tried out using the following test code:
static void Main(string[] args) { for (var i = 0; i < 1000; i++) { Console.WriteLine("pred: " + HashCodeOracle.Next()); var o = new object(); Console.WriteLine("fact: " + o.GetHashCode()); } // for i } // Main()
A snippet from the result:
pred: 3863873 fact: 3863873 pred: 34774863 fact: 34774863 pred: 44538317 fact: 44538317 pred: 65300541 fact: 65300541 pred: 50833958 fact: 50833958 pred: 54852443 fact: 54852443
So the hash value can be predicted by the random function.
How unique is the hash code?
We’ve mentioned that the result of the GetHashCode() call is 32 bit wide. In fact it is only 26 bit – the few bits on the higher part are reserved for other purposes. It means that 2^26=67 million different values can be resulted. That’s many, isn’t it? Not really.
Birthday Attack
Some algorithms based on weak hash functions can be forged applying the “Birthday Problem”. To understand the Birthday Problem, answer the following questions: how many people are needed in a room to reach 50% probability that two people have the same birthday (year ignored)?Most people – including me – would say a big number, maybe around 300 or more as 366 different birth days exist.
The correct answer, however, is 23. It is a big surprise and it is also interesting that 57 people are enough to reach 99%. So if there are 55-60 people in a room, it is almost certain that two of them have the same birthday.
Back to our problem, how many hash codes are needed to find two that are identical with 50% probability? Applying the formula from Wikipedia, using 26 bit hash code, 9600 hash codes are sufficient to have 50% chance for a collision. If 25000 hash codes are generated, the probability of collision is 99%. This is not a big number, a small application can reach it easily.
Hash Collision in practice
It is easy to check the magnitude of the calculations above. Let’s write a small program which generates hash values until a collision happens.
var hashCodes = new HashSet<int>(); int i = 0; while (hashCodes.Add(HashCodeOracle.Next())) { i++; } // while Console.WriteLine(i);
The result of the program (which depends on many factors) on my computer is 5310.
Where does hash value get stored?
After Object.GetHashCode() generated a random value, this value must be stored and associated with the instance. Otherwise every GetHashCode() call would return a different value for the same instance. So where does this value get stored? In a hidden field of Object? Well, the answer is yes and no.
SyncBlockIndex aka sink
When we create an instance and memory is being reserved to hold instance fields, some additional space also gets reserved. Many people have already heard about the first field of the instance footprint. Sometimes it is called type handle, sometimes method table and sometimes type object pointer – and in my opinion the latter is a bit misleading. This field is the modern version of the C++ virtual table pointer. It helps to resolve virtual and interface method addresses, to find static member addresses and and has other purposes. The method table referenced by the type handle is type specific – not instance specific – so generated hash values cannot be stored in the method table.
Everyone who read the excellent book “CLR via C#” may remember that there is another hidden field reserved in memory beside instance fields – called SyncBlockIndex. Where is this field? Savvy readers and those with C++ background may know that the memory pointed by a reference starts with 4 (or 8 on 64 bit systems) bytes type handle, then followed by instance fields. There is no other field here. Now, the location of the SyncBlockIndex is prior to the reference at the -4 (or -8) offset. To help visualize it, look at the picture below:
The funny thing about SyncBlockIndex is that the SSCLI source code refers it as sink – it is piled on by many unrelated things. What can be found here? The original purpose of it is to support object locking – using the lock keyword or the Monitor class on an instance. What else? If the instance is a String, there is some additional information here that can be used for optimization by the CLR. All uppercase, all ASCII, things like that, helping string comparsions/conversions. The Garbage Collector also has a bit here – this is how it knows if an instance has already been visited, avoiding circular walking.
The main purpose of the SyncBlockIndex is to support locking. Most of the instances will never be locked, however, so the majority of the bits in SyncBlockIndex will never be used. Apart from the special flags, it means 26 rarely used bits. Does this number look familiar? Yes, this is the effective bit length of the hash value returned by GetHashCode() – and this is not a coincidence.
Let’s check whether the hash code is really being stored in SyncBlockIndex:
The figure is a bit complex at first sight. First, there is a simple program code that creates a System.Object instance, and then obtains a hash code for it (1). At the bottom of the screen in the black console window the hash code is printed in hexadecimal representation.
Around (2) in the Immediate window the SOS extension has been already loaded. We need this to trace local variables, which helps us to find the Object instance. Using !CLRStack -l shows the locals in Main(), one of them is a reference to the Object instance. The address of this instance in memory is 0x0278b2ec. We mentioned that the SyncBlockIndex is at offset -4, so if we want to dump the content of the SyncBlockIndex, we need to use address 0x0278b3e8 (3). The first 4 bytes of the memory dump is 98 80 bf 0e. It does not look like the hash value. What is misleading here is that the program is hosted by a processor using Little Endian representation – so the bytes of the 32 bit value are reversed. Now we are almost there, the only subtle difference is in the first digit, which is 0x0e instead of 0x02. This is because the upper 6 bits are flags. The upper 4 bits happen to be zeros, but the next 2 bits are switched on – this gives the 0x0E value. If we ignore these two bits, we get the expected 0x02.
Now we know that the hash value is stored in the SynvBlockIndex. But what about the original purpose of the SyncBlockIndex? What happens if we lock that Object instance?
The C# lock keyword is built on the Monitor class which is built on a Critical Section object. The Windows operating system has a Critical Section construct but the .NET CLR has its own implementation. A Critical Object is a quite simple structure describing things like the lock owner thread id, flags, etc. When an object gets locked in .NET, the CLR creates a new Critical Section structure. Critical Section structures are arranged in an array. The association between Critical Section and the locked instance is expressed by an array index held by the SyncBlockIndex. Again, when an instance is used in lock, its SyncBlockIndex gets an index to the Critical Section array.
But what happens with the stored hash value in this case? Will it be lost? No, of course not. It will be saved in a different location. The above mentioned array is not a simple Critical Section array. It is an array of broader structures called SyncBlocks. The name SyncBlock is not really descriptive, because it is not only about synchronization. This structure has more bins, one for the Critical Section structure and there is another one for the hash value.
Depending whether an object is used in lock or not, the CLR stores hash in the SyncBlock structure or in the SyncBlokcIndex. Look at the following picture:
At first this picture is even fuzzier than the previous one, but following the numbers it can be understood. The program asks for a hash code. We saw it previously that this number gets stored in SyncBlockIndex. Then the program locks on the Object instance (1). It means SyncBlockIndex must be used as an index. With the help of SOS (2) we get the address of the Object instance in memory. Looking at the SyncBlockIndex area at -4 offset, we can see that its structure is different now. Apart from the flags in the most significant byte, it contains a simple one as a value (3). So the SyncBlockIndex now contains the index value one. If we list the SyncBlock array with SOS (4), we can find the memory area where SyncBlock structure is stored. Checking this memory area we can see the hash value (Little Endian encoded).
Conclusion
The internal mechanism of the default GetHashCode() implementation is not important to know for a developer. False assumptions might cause problems, however. If a developer assumes that default hash code is a good id for his instances or that the default implementation gives the same value for the same instance state (because it is calculated from instance fields) then his software will be incorrect. (ok, default implementation of hash code calculation for value types uses field content, not random, but this is a different story)
Original article in Hungarian: Mit ad vissza az Object.GetHashCode()?