Thursday, April 9, 2009

Interned Arrays?

Intro to String.intern()


In Java, string comparison is done with the String.equals() method as opposed to using the == operator. This is because the == operator only compares references.

String s1 = "first ";
s1 += "second"; // All these lines are to prevent Java from optimizing
String s2 = "first"; // and concatenating the literals at compile time and
s2 += " second"; // defeating the purpose of this snippet.
assert s1.equals(s2) && !(s1 == s2);

That is to say, s1 and s2 were constructed in different ways but now represent equals strings (they're both "first second".). The .equals() method knows this but the strings are at different locations in memory.

Strings in Java (unlike in C) are immutable. That is to say that once I have a reference to a String, I know that it's value will never change because it is downright impossible for me or anyone else to do so (without using reflection. that's cheating). What does this imply?

When you spell out two different literal strings in your source code that actually have the same value, Java gets smart and really only makes one string and points both references to it.

String s1 = "hello", s2 = "hello";
assert s1 == s2;

It's harmless to do so because String's are immutable. This does a few good things. It takes up less space (which many people might not care about) and it optimizes comparisons. You can see this optimization in Java's String class source code (which Sun has so generously opened):

public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}

You can see that a call to .equals() with the object as its own parameter will return in one comparison. But is this optimization only applicable to literal String's?

No, you can create this effect at runtime with the String.intern() method. It returns a unique reference (possibly its own) to a String with the same value as itself. If two String's have the same value, then their .intern() methods will return the same reference.

String s1 = "first ";
s1 += "second";
String s2 = "first";
s2 += " second";
assert s1 != s2;
s1 = s1.intern();
s2 = s2.intern();
assert s1 == s2;

This is done by Java managing a pool of String's that are the official intern representatives, so to speak. When you call .intern(), a match is found from pool and returned, or if there isn't one yet, then you your string goes in the pool and becomes the official intern.

The call to .intern() is potentially expensive, but the idea is to save time later during comparisons and memory usage.

Why use it?


All literal String's go into the pool:

String s1 = "first ";
s1 += "second";
s1 = s1.intern();
String literal = "first second";
assert s1 == literal;

This means that if you are using some arbitrary String constants as property keys, you can just use the equality operator instead of the .equals() method.

public static final String HEIGHT_PROPERTY = "HEIGHT_PROPERTY";

public void setProperty(String key, int value) {
if (key == HEIGHT_PROPERTY)
this.height = value;
}
public void example() {
setProperty(HEIGHT_PROPERTY, 5);
}

When is this not the case? If you're reading property key/value pairs out of a file or from a stream, then String's are constructed from scratch and the newly created property maps don't use the same memory as your literal.

Property File:

HEIGHT_PROPERTY=5

Parsing Code:

String line = scanner.nextLine();
String[] keyAndValue = line.split("=");
String key = keyAndValue[0];
String value = keyAndValue[1];

Here, the key is equal to HEIGHT_PROPERTY according to the .equals() method, but not the == operator. Instead of changing all your code to use .equals() instead of the == operator, you can just intern the key.

String key = keyAndValue[0].intern();

Now you can still use the == operator throughout your code, and, as long as there are enough comparisons performed after the value is read from the file, you've increased performance.

Interned Arrays?



In Java, the size of an array is immutable. In order to increase the size of an array, you have to create a new one, then copy the contents into it.


Object[] objects = new Object[10];
fillThisArray(objects);
// now we need it to be bigger
Object[] newObjects = new Object[20];
System.arraycopy(objects, 0, newObjects, 0, 10);
objects = newObjects;
fillMore(objects);


This doesn't mean that arrays are themselves immutable in the same way that String's are because you can change the contents of an array. The only exception to this is, of course, zero-length arrays, who don't have any contents to change. This means that one zero-length array of a type is indistinguishable from another, leading to programmers getting smart and keeping just one zero-length array around and using it instead of creating new ones all the time.

Instead of writing this:

public Result[] getResults() {
if (resultGenerator == null)
return new Result[0];
else
return resultGenerator.getResults();
}

they write this:

private static final Result[] NO_RESULTS = new Result[0];

public Result[] getResults() {
if (resultGenerator == null)
return NO_RESULTS;
else
return resultGenerator.getResults();
}


This is a good idea, in my opinion, but also creates clutter just for the sake of efficiency. And what if another class were also storing a zero-length array of Result's? Then the purpose is partially defeated because there isn't just one instance. Is there a solution?

If the Java language 'interned' zero-length arrays, then this problem would be solved (and look much nicer).


Object[] a1 = new Object[0];
Object[] a2 = new Object[0];
assert a1 == a2; // fails :(


Currently, Java does not intern literal zero-length arrays, like the ones above. I agree that the "new" keyword looks like it should actually create something new. If, however, Java kept a special copy of zero-length arrays around for each type, it would take care of some optimizations that are ugly to write in code. The reference to the zero-length array would be inserted into the bytecode instead of the call to create a new one of length zero. It would only work for explicitly zero-length arrays. It wouldn't work for the following code.


private Object[] makeArray(int size) {
return new Object[size];
}
private void example() {
Object[] a1 = new Object[0];
Object[] a2 = makeArray(0);
assert a1 == a2; // hopeless
}


In this situation, there's no way around having two instances of zero-length arrays. Having a .intern() function on arrays would be more hassle than it's worth, because it only applies to zero-length arrays, which is a property that's usually only known at runtime.

What I'm proposing is that the code

new SomeType[0]

will never allocate memory, but will return a reference to an already existing zero-length array of type SomeType.

What do you think?

No comments: