Hashcode Of A String In Java

Posted on October 5, 2009


Many of the Java programmers know what 'Hashcode' means, but don't really know how exactly it is calculated and why 31 is used to calculate the hashcode. Below is the code snippet from Java 1.6, which calculates the hashcode for a string:

public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;

for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}

return h;
}
Even if someone knows why 31 is used, there is a lot of stuff to know about 'Hashing', 'Hash Collisions' and multiple algorithms related to calculating hash values. First off, its a known fact that there is no perfect hashing algorithm, for which there are no collisions. But there are several algorithms, which minimize the collisions and are good enough to use. Now, coming to why 31 is used in calculating hashcode, this is the reason given by Joshua Bloch, in the book 'Effective Java':
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
This wasn't sufficient for me, to understand why 31 is used. I did a bit of research and found some good links, providing some info about why 31 is used. Here are some links with very good info: Stack Overflow - Why does java hashcode use 31 as a multiplier Apart from the above link, there are a couple of other links with pretty good info about hashing, hash collisions and performance of hashing algorithms: Consistency Of Hashcode On A Java String Best Hashing Algos, In Terms Of Collisions and Performance Java Array Hashcode Implementation

After reading up a bit, I wrote a sample test Java program, to find the hashcode of a string by multiplying by 31 (which is the same as shifting left (bitwise) by 5 times and subtracting, as in (i << 5) - i). Below is the sample test program:

public class TestHash {
public static void main(String[] args) {
String str1 = "What the heck?";

int hashcode1 = 0;
int hashcode2 = 0;

for(int i=0;i<str1.length();i++) {
hashcode1 = 31*hashcode1 + str1.charAt(i);
hashcode2 = (hashcode2 << 5) - hashcode2 + str1.charAt(i);
}

System.out.println("Hashcode1 : " + hashcode1);
System.out.println("Hashcode2 : " + hashcode2);
}
}


The output for this program is:

Hashcode1 : 277800975
Hashcode2 : 277800975


Apart from the above info, I want to share some info from the recent article on java.sun.com, written by Joseph Darcy. It was an interesting case of 'Unhashing' - Reverse Engineering Hashcode, To Find A String That Collides With The Actual String. This was an interesting way of finding 'Hash Collisions'. I then tested the code from Joseph Darcy, by writing a sample program, as below:

public class TestHash {
/**
* @author - Babji P, Chetty
*/
public static void main(String[] args) {
String str1 = "what the heck?";

int hashcode1 = 0;
int hashcode2 = 0;

for(int i=0;i<str1.length();i++) {
hashcode1 = 31*hashcode1 + str1.charAt(i);
hashcode2 = (hashcode2 << 5) - hashcode2 + str1.charAt(i);
}

System.out.println("Hashcode1 : " + hashcode1);
System.out.println("Hashcode2 : " + hashcode2);

String str2 = unhash(hashcode1);
System.out.println("Unhashed String From Hashcode : " + str2);

int hashcode3 = 0;
int hashcode4 = 0;

for(int i=0;i<str2.length();i++) {
hashcode3 = 31*hashcode3 + str2.charAt(i);
hashcode4 = (hashcode4 << 5) - hashcode4 + str2.charAt(i);
}

System.out.println("Hashcode3 : " + hashcode3);
System.out.println("Hashcode4 : " + hashcode4);

}

/**
* Returns a string with a hash equal to the argument.
* @return string with a hash equal to the argument.
* @author - Joseph Darcy
*/
public static String unhash(int target) {
StringBuilder answer = new StringBuilder();
if (target < 0) {
// String with hash of Integer.MIN_VALUE, 0x80000000
answer.append("\u0915\u0009\u001e\u000c\u0002");

if (target == Integer.MIN_VALUE)
return answer.toString();
// Find target without sign bit set
target = target & Integer.MAX_VALUE;
}

unhash0(answer, target);
return answer.toString();
}

/**
*
* @author - Joseph Darcy
*/
private static void unhash0(StringBuilder partial, int target) {
int div = target / 31;
int rem = target % 31;

if (div <= Character.MAX_VALUE) {
if (div != 0)
partial.append((char)div);
partial.append((char)rem);
} else {
unhash0(partial, div);
partial.append((char)rem);
}
}
}



The output for the above program is:

Hashcode1 : 1279794159
Hashcode2 : 1279794159
Unhashed String From Hashcode : ?☻§◄
Hashcode3 : 1279794159
Hashcode4 : 1279794159


So the strings "what the heck?" and "?☻§◄" have the same hashcode. What the heck?!?!

Blog Categories
Disclaimer
The views expressed on this blog are my personal views and do not reflect the views of my employer or campaigns I am supporting.

All sample code is provided for illustrative purposes only. These examples have not been thoroughly tested under all conditions. The writer therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

All programs contained herein are provided to you "AS IS" without any warranties of any kind. The implied warranties of non-infringement, merchantability and fitness for a particular purpose are expressly disclaimed.