Blog

You are viewing entries tagged java and may want to check out the most recent entries.

I want to combine the elements of multiple Stream instances into a single Stream. What's the best way to do this?

This article compares a few different solutions.

Stream.concat(a, b)

The JDK provides Stream.concat(a, b) for concatenating two streams.

void exampleConcatTwo() {
  Stream<String> a = Stream.of("one", "two");
  Stream<String> b = Stream.of("three", "four");
  Stream<String> out = Stream.concat(a, b);
  out.forEach(System.out::println);
  // Output:
  // one
  // two
  // three
  // four
}

What if we have more than two streams?

We could use Stream.concat(a, b) multiple times. With three streams we could write Stream.concat(Stream.concat(a, b), c).

To me that approach is depressing at three streams, and it rapidly gets worse as we add more streams.

Reduce

Alternatively, we can use reduce to perform the multiple incantations of Stream.concat(a, b) for us. The code adapts elegantly to handle any number of input streams.

void exampleReduce() {
  Stream<String> a = Stream.of("one", "two");
  Stream<String> b = Stream.of("three", "four");
  Stream<String> c = Stream.of("five", "six");
  Stream<String> out = Stream.of(a, b, c)
      .reduce(Stream::concat)
      .orElseGet(Stream::empty);
  out.forEach(System.out::println);
  // Output:
  // one
  // two
  // three
  // four
  // five
  // six
}

Be careful using this pattern! Note the warning in the documentation of Stream.concat(a, b):

Use caution when constructing streams from repeated concatenation. Accessing an element of a deeply concatenated stream can result in deep call chains, or even StackOverflowError.

It takes quite a few input streams to trigger this problem, but it is trivial to demonstrate:

void exampleStackOverflow() {
  List<Stream<String>> inputs = new AbstractList<Stream<String>>() {
    @Override
    public Stream<String> get(int index) {
      return Stream.of("one", "two");
    }

    @Override
    public int size() {
      return 1_000_000; // try changing this number
    }
  };
  Stream<String> out = inputs.stream()
      .reduce(Stream::concat)
      .orElseGet(Stream::empty);
  long count = out.count(); // probably throws
  System.out.println("count: " + count); // probably never reached
}

On my workstation, this method throws StackOverflowError after several seconds of churning.

What's going on here?

We can think of the calls to Stream.concat(a, b) as forming a binary tree. At the root is the concatenation of all the input streams. At the leaves are the individual input streams. Let's look at the trees for up to five input streams as formed by our reduce operation.

Two streams:
concat(a,b)ab
Three streams:
concat(concat(a,b),c)concat(a,b)cab
Four streams:
concat(concat(concat(a,b),c),d)concat(concat(a,b),c)dconcat(a,b)cab
Five streams:
concat(concat(concat(concat(a,b),c),d),e)concat(concat(econcat(a,b),c),d)concat(concat(a,b),c)dconcat(a,b)cab

The trees are perfectly unbalanced! Each additional input stream adds one layer of depth to the tree and one layer of indirection to reach all the other streams. This can have a noticeable negative impact on performance. With enough layers of indirection we'll see a StackOverflowError.

Balance

If we're worried that we'll concatenate a large number of streams and run into the aforementioned problems, we can balance the tree. This is as if we're optimizing a O(n) algorithm into a O(logn) one. We won't totally eliminate the possibility of StackOverflowError, and there may be other approaches that perform even better, but this should be quite an improvement over the previous solution.

void exampleBalance() {
  Stream<String> a = Stream.of("one", "two");
  Stream<String> b = Stream.of("three", "four");
  Stream<String> c = Stream.of("five", "six");
  Stream<String> out = concat(a, b, c);
  out.forEach(System.out::println);
  // Output:
  // one
  // two
  // three
  // four
  // five
  // six
}

@SafeVarargs
static <T> Stream<T> concat(Stream<T>... in) {
  return concat(in, 0, in.length);
}

static <T> Stream<T> concat(Stream<T>[] in, int low, int high) {
  switch (high - low) {
    case 0: return Stream.empty();
    case 1: return in[low];
    default:
      int mid = (low + high) >>> 1;
      Stream<T> left = concat(in, low, mid);
      Stream<T> right = concat(in, mid, high);
      return Stream.concat(left, right);
  }
}

Flatmap

There is another way to concatenate streams that is built into the JDK, and it does not involve Stream.concat(a, b) at all. It is flatMap.

void exampleFlatMap() {
  Stream<String> a = Stream.of("one", "two");
  Stream<String> b = Stream.of("three", "four");
  Stream<String> c = Stream.of("five", "six");
  Stream<String> out = Stream.of(a, b, c).flatMap(s -> s);
  out.forEach(System.out::println);
  // Output:
  // one
  // two
  // three
  // four
  // five
  // six
}

This generally outperforms the solutions based on Stream.concat(a, b) when each input stream contains fewer than 32 elements. As we increase the element count past 32, flatMap performs comparatively worse and worse as the element count rises.

flatMap avoids the StackOverflowError issue but it comes with its own set of quirks. For example, it interacts poorly with infinite streams. Calling findAny on the concatenated stream may cause the program to enter an infinite loop, whereas the other solutions would terminate almost immediately.

void exampleInfiniteLoop() {
  Stream<String> a = Stream.generate(() -> "one");
  Stream<String> b = Stream.generate(() -> "two");
  Stream<String> c = Stream.generate(() -> "three");
  Stream<String> out = Stream.of(a, b, c).flatMap(s -> s);
  Optional<String> any = combined.findAny(); // infinite loop
  System.out.println(any); // never reached
}

(The infinite loop is an implementation detail. This could be fixed in the JDK without changing the contract of flatMap.)

Also, flatMap forces its input streams into sequential mode even if they were originally parallel. The outermost concatenated stream can still be made parallel, and we will be able to process elements from distinct input streams in parallel, but the elements of each individual input stream must all be processed sequentially.

Analysis

Let me share a few trends that I've noticed when dealing with streams and stream concatenation in general, having written a fair amount of code in Java 8 by now.

  • There have been maybe one dozen cases where I've needed to concatenate streams. That's not all that many, so no matter how good the solution is, it's not going to have much of an impact for me.
  • In all but one of those one dozen cases, I needed to concatenate exactly two streams, so Stream.concat(a, b) was sufficient.
  • In the remaining case, I needed to concatenate exactly three streams. I was not even close to the point where StackOverflowError would become an issue. Stream.concat(Stream.concat(a, b), c) would have worked just fine, although I went with flatMap because I felt that it was easier to read.
  • I have never needed to concatenate streams in performance-critical sections of code.
  • I use infinite streams very rarely. When I do use them, it is obvious in context that they are infinite. And so concatenating infinite streams together and then asking a question like findAny on the result is just not something that I would be tempted to do. That particular issue with flatMap seems like one that I'll never come across.
  • I use parallel streams very rarely. I think I've only used them twice in production code. It is almost never the case that going parallel improves performance, and even when it might improve performance, it is unlikely that processing them in the singleton ForkJoinPool.commonPool() is how I will want to manage that work. The issue with flatMap forcing the input streams to be sequential seems very unlikely to be a real problem for me.
  • Let's suppose that I do want to concatenate parallel streams and have them processed in parallel. If I have eight input streams on an eight core machine, and each stream has roughly the same number of elements, the fact that flatMap forces the individual streams to be sequential will not degrade performance for me at all. All eight cores will be fully utilized, each core processing one of the eight input streams. If I have seven input streams on that same machine, I will see only slightly degraded performance. With six, slightly more degraded, and so on.

What's the takeaway from all this? Here is my advice:

For two input streams, use:
Stream.concat(a, b)

For more than two input streams, use:
Stream.of(a, b, c, ...).flatMap(s -> s)

That solution is good enough...

Overboard

...but what if we're not satisfied with "good enough"? What if we want a solution that's really fast no matter the size and shape of the input and doesn't have any of the quirks of the other solutions?

It is a bit much to inline in a blog article, so take a look at StreamConcatenation.java for the source code.

This implementation is similar to Stream.concat(a, b) in that it uses a custom Spliterator, except this implementation handles any number of input streams.

It performs quite well. It does not outperform every other solution in every scenario (flatMap is generally better for very small input streams), but it never performs much worse and it scales nicely with the number and size of the input streams.

Benchmark

I wrote a JMH benchmark to compare the four solutions discussed in this article. The benchmark uses each solution to concatenate a variable number of input streams with a variable number of elements per stream, then iterates over the elements of the concatenated stream. Here is the raw JMH output from my workstation and a prettier visualization of the benchmark results.

July 5, 2016

Mangling JSON numbers

If we have a long (64-bit integer) that we serialize into JSON, we might be in trouble if JavaScript consumes that JSON. JavaScript has the equivalent of double (64-bit floating point) for its numbers, and double cannot represent the same set of numbers as long. If we are not careful, our long is mangled in transit.

Consider 253 + 1. We can store that number in a long but not a double. Above 253, double does not have the bits required to represent every integer, creating gaps between the integers it can represent. 253 + 1 is the first integer to fall in one of these gaps. We can store 253 or 253 + 2 in a double, but 253 + 1 does not fit.

If we store 253 + 1 in a long and that number is meant to be precise, then we should avoid encoding it as a JSON number and sending it to a JavaScript client. The instant that client invokes JSON.parse they are doomed — they see a different number.

The JSON format does not mandate a particular number precision, but the application code on either side usually does. See also: Re: [Json] Limitations on number size?

This problem only occurs with very large numbers. Perhaps all the numbers we use are safe. Are we actually mangling our numbers? Probably not...

...but will we know? Will anything blow up, or will our application be silently, subtly wrong?

I suspect that when this problem does occur, it goes undetected for longer than it should. In the remainder of this article, we examine potential improvements to our handling of long.

Failing fast

We can change the way we serialize long into JSON.

When we encounter a long, we can require that the number fits into a double without losing information. If no information would be lost, we serialize the long as usual and move on. If information would be lost, we throw an exception and cause serialization to fail. We detonate immediately at the source of the error rather than letting it propagate around, doing who knows what.

Here is a utility method that can be used for this purpose:

public static void verifyLongFitsInDouble(long x) {
  double result = x;
  if (x != (long) result || x == Long.MAX_VALUE) {
    throw new IllegalArgumentException("Overflow: " + x);
  }
}

This approach appeals to me because it is unobtrusive. The check can be made in one central location, no changes to our view classes or client-side code are required, and it only throws exceptions in the specific cases where our default behavior is wrong.

A number that should be safe

Consider the number 262, which spelled out in base ten is 4611686018427387904. This number fits in both a long and a double. It passes our verifyLongFitsInDouble check. Theoretically we can send it from a Java server to a JavaScript client via JSON and both sides see exactly the same number.

To convince ourselves that this number is safe, we examine various representations of this number in Java and JavaScript:

// In Java
long x = 1L << 62;
System.out.println(Long.toString(x));    // 4611686018427387904
System.out.println(Double.toString(x));  // 4.6116860184273879E18
// 100000000000000000000000000000000000000000000000000000000000000
System.out.println(Long.toString(x, 2)); 
  
// In JavaScript
var x = Math.pow(2, 62);
console.log(x.toString());               // 4611686018427388000
console.log(x.toExponential());          // 4.611686018427388e+18
console.log(x.toFixed());                // 4611686018427387904
// 100000000000000000000000000000000000000000000000000000000000000
console.log(x.toString(2));

The output of x.toString() in JavaScript is suspicious. Do we really have the right number? We do, but we print it lazily.

x.toString() is similar in spirit to x.toExponential() and Double.toString(double) from Java. These algorithms essentially print significant digits, from most significant to least, until the output is unambiguously closer to this floating point number than any other floating point number. (And that is true here. The next lowest floating point number is 262 - 512, the next highest is 262 + 1024, and 4611686018427388000 is closer to 262 than either of those two nearby numbers.) See also: ES6 specification for ToString(Number)

x.toFixed() and the base two string give us more confidence that we have the correct number.

Verifying our assumptions with code

If 262 really is a safe number, we should be able to send it from the server to the client and back again. To verify that this number survives a round trip, we create an HTTP server with two kinds of endpoints:

  • GET endpoints that serialize a Java object into a JSON string like {"x":number}, where the number is a known constant (262). The number and the JSON string are printed to stdout. The response is that JSON string.
  • POST endpoints that deserialize a client-provided JSON string like {"x":number} into a Java object. The number and JSON string are printed to stdout. We hope that the number printed here is the same as the known constant (262) used in our GET endpoints.

Any server-side web framework or HTTP server will do. We happen to use JAX-RS in our example code.

Behavior may differ between JSON (de)serialization libraries, so we test two:

In total the server provides four endpoints, each named after the JSON serialization library used by that endpoint:

GET   /gson
POST  /gson
GET   /jackson
POST  /jackson

In the JavaScript client, we:

  • Loop through each library-specific pair of GET/POST endpoints.
  • Make a request to the GET endpoint.
  • Use JSON.parse to deserialize the response text (a JSON string) into a JavaScript object.
  • Use JSON.stringify to serialize that JavaScript object back into a JSON string.
  • Print each of the following to the console:
    • the incoming JSON string
    • the number contained in the JavaScript object, using x.toString()
    • the number contained in the JavaScript object, using x.toFixed()
    • the outgoing JSON string
  • Make a request to the POST endpoint, providing the (re)serialized JSON string as the request body.

Here is the server-side Java code:

package test;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.gson.Gson;

import javax.ws.rs.Consumes;
import javax.ws.rs.GET;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import java.io.IOException;

@Path("/")
public final class JsonResource {

  public static final class Payload {
    public long x;
  }

  private static final long EXPECTED_NUMBER = 1L << 62;

  @GET
  @Path("gson")
  @Produces("application/json")
  public String getGson() {
    Payload object = new Payload();
    object.x = EXPECTED_NUMBER;
    String json = new Gson().toJson(object);
    System.out.println("GET   /gson     outgoing number:  "
        + object.x);
    System.out.println("GET   /gson     outgoing JSON:    " 
        + json);
    return json;
  }

  @POST
  @Path("gson")
  @Consumes("application/json")
  public void postGson(String json) {
    Payload object = new Gson().fromJson(json, Payload.class);
    System.out.println("POST  /gson     incoming JSON:    " 
        + json);
    System.out.println("POST  /gson     incoming number:  "
        + object.x);
  }

  @GET
  @Path("jackson")
  @Produces("application/json")
  public String getJackson() throws IOException {
    Payload object = new Payload();
    object.x = EXPECTED_NUMBER;
    String json = new ObjectMapper().writeValueAsString(object);
    System.out.println("GET   /jackson  outgoing number:  "
        + object.x);
    System.out.println("GET   /jackson  outgoing JSON:    "
        + json);
    return json;
  }

  @POST
  @Path("jackson")
  @Consumes("application/json")
  public void postJackson(String json) throws IOException {
    Payload object = new ObjectMapper().readValue(json, Payload.class);
    System.out.println("POST  /jackson  incoming JSON:    "
        + json);
    System.out.println("POST  /jackson  incoming number:  "
        + object.x);
  }
}

Here is the client-side JavaScript code:

[ "/gson", "/jackson" ].forEach(function(endpoint) {
  function handleResponse() {
    var incomingJson = this.responseText;
    var object = JSON.parse(incomingJson);
    var outgoingJson = JSON.stringify(object);
    console.log(endpoint + " incoming JSON: " + incomingJson);
    console.log(endpoint + " number toString: " + object.x);
    console.log(endpoint + " number toFixed: " + object.x.toFixed());
    console.log(endpoint + " outgoing JSON: " + outgoingJson);
    var post = new XMLHttpRequest();
    post.open("POST", endpoint);
    post.setRequestHeader("Content-Type", "application/json");
    post.send(outgoingJson);
  };
  var get = new XMLHttpRequest();
  get.addEventListener("load", handleResponse);
  get.open("GET", endpoint);
  get.send();
});

The results are disappointing

Here is the server-side output:

GET   /gson     outgoing number:  4611686018427387904
GET   /gson     outgoing JSON:    {"x":4611686018427387904}
POST  /gson     incoming JSON:    {"x":4611686018427388000}
POST  /gson     incoming number:  4611686018427388000
GET   /jackson  outgoing number:  4611686018427387904
GET   /jackson  outgoing JSON:    {"x":4611686018427387904}
POST  /jackson  incoming JSON:    {"x":4611686018427388000}
POST  /jackson  incoming number:  4611686018427388000

Here is the client-side output:

/gson incoming JSON: {"x":4611686018427387904}
/gson number toString: 4611686018427388000
/gson number toFixed: 4611686018427387904
/gson outgoing JSON: {"x":4611686018427388000}
/jackson incoming JSON: {"x":4611686018427387904}
/jackson number toString: 4611686018427388000
/jackson number toFixed: 4611686018427387904
/jackson outgoing JSON: {"x":4611686018427388000}

Both of our POST endpoints print the wrong number. Yuck!

We do send the correct number to JavaScript, which we can verify by looking at the output of x.toFixed() in the console. Something bad happens between when we print x.toFixed() and when we print the number out on the server.

Why is our code wrong?

Maybe there is a particular line of our own code where we can point our finger and say, "Aha! You are wrong!" Maybe it is an issue with our architecture.

There are many ways we could choose to address this problem (or not), and what follows is certainly not an exhaustive list.

“We call JSON.parse then JSON.stringify. We should echo back the original JSON string.”

This avoids the problem but is nothing like a real application. The test code is standing in for an application that gets the payload object from the server, uses it as an object throughout, then later/maybe makes a request back to the server containing some or all of the data from that object.

In practice, most applications will not even see the JSON.parse call. The call will be hidden. The front-end framework will do it, $.getJSON will do it, etc.

“We use JSON.stringify. We should write an alternative to JSON.stringify that produces an exact representation of our number.”

JSON.stringify delegates to x.toString(). If we never use JSON.stringify, and instead we use something like x.toFixed() to print numbers like this, we can avoid this problem.

This is probably infeasible in practice.

If we need to produce JSON from JavaScript, of course we expect that JSON.stringify will be involved. As with JSON.parse, most calls happen at a distance in a library rather than our own application code.

Besides, if we really plan to avoid x.toString(), we must do so everywhere. This is hopeless.

Suppose we commit to avoiding x.toString() and we have user objects that each have a numeric id field. We can no longer write Mustache or Handlebars templates like this:

<div id="user{{id}}">    {{! functionally wrong }}
  <p>ID: {{id}}</p>      {{! visually wrong }}
  <p>Name: {{name}}</p>
</div>

We can no longer write functions like this:

function updateEmailAddress(user, newEmail) {
  // Oops, we failed for user #2^62!
  var url = "/user/" + user.id + "/email";

  // Tries to update the wrong user (and fails, hopefully)
  $.post(url, { email: newEmail });
}

It is extremely unlikely that we will remember to avoid x.toString() everywhere. It is much more likely that we will forget and end up with incorrect behavior all over the place.

“We treat the number as a long literal in the POST handlers. We should treat the number as a double literal.”

If we parse the number as a double and cast it to a long, we produce the correct result in all test cases.

Such a cast should be guarded with a check similar to our verifyLongFitsInDouble(long) code from earlier. Here is a utility method that can be used for this purpose:

public static void verifyDoubleFitsInLong(double x) {
  long result = (long) x;
  if (Double.compare(x, result) != 0 || result == Long.MAX_VALUE) {
    throw new IllegalArgumentException("Overflow: " + x);
  }
}

What if the client really does mean to send us precisely the integer 4611686018427388000? If we parse it as a double then cast it to a long, we mangle the intended number!

Here it is worth considering who we actually talk to as we design our APIs. If we only talk to JavaScript clients, then we only receive numbers that fit in double because that is all our clients have. Often times these APIs are internal and the API authors are the same as the client code authors. It is reasonable in cases like that to make assumptions about who is calling us, even if technically some other caller could use our API, because we make no claim to support other callers.

If our API is designed to be public and usable by any client, we should document our behavior with respect to number precision. verifyLongFitsInDouble(long) and verifyDoubleFitsInLong(double) are tricky to communicate, so we may prefer a simpler rule...

“We permit some values of long outside of the range -253 < x < 253. We should reject values outside of that range even when they fit in double.”

In other words, perform a bounds check on every long number that we (de)serialize. If the absolute value of that number is less than 253 then we (de)serialize that number as usual, otherwise we throw an exception.

JavaScript clients may find this range familiar, with built-in constants to express its bounds: Number.MIN_SAFE_INTEGER and Number.MAX_SAFE_INTEGER.

This approach is less permissive than our verifyLongFitsInDouble(long) and verifyDoubleFitsInLong(double) utility methods from earlier. Those methods permit every number in this range and then more. Those methods permit numbers whose adjacent values are invalid, meaning the range of valid inputs is not contiguous.

Advantages of the less permissive approach include:

  • It is easier to express in documentation. verifyLongFitsInDouble(long) and verifyDoubleFitsInLong(double) would permit 255 + 8 but not 255 + 4. Understanding the reason for that is more difficult than understanding that neither of those numbers are permitted with the |x| < 253 approach.
  • If we are actually serializing numbers like 255 + 8, it is likely that we are trying serialize nearby numbers that cannot be stored in double. Permitting the extra numbers may only mask the underlying problem: this data should not be serialized into JSON numbers.

“We encode a long as a JSON number. We should encode it as a JSON string.”

Encoding the number as a string avoids this problem.

Twitter provides string representations of its numeric ids for this reason.

This is easy to accomplish on the server. JSON serialization libraries provide a way to adopt this convention without changing the field types of our Java classes. Our Payload class keeps using long for its field, but any time the server serializes that field into JSON, it surrounds the numeric literal with quotation marks.

How viable is this approach for the client? If the number is only being used as an identifier—passed between functions as-is, compared using the === operator, used as a key in maps—then treating it as a string makes a lot of sense. If we are lucky, the client-side code is identical between the string-using and number-using versions.

If the number is used in arithmetic or passed to libraries that expect numbers, then this solution becomes less practical.

“We use JSON as the serialization format. We should use some other serialization format.”

The JSON format is not to blame for our problems, but it allows us to be sloppy.

When we use JSON we lose information about our numbers. We do not lose the values of the numbers, but we do lose the types, which tell us the precision.

A different serialization format such as Protobuf might have forced us to clarify how precise our numbers are.

“There is no problem.”

We could declare that there is no problem. Our code breaks when provided with obscenely large numbers as input, but we simply do not use numbers that large and we never will. And even though our numbers are never this large, we still want to use long in the Java code because that is convenient for us. Other Java libraries produce or consume long numbers, and we want to use those libraries without casting.

I suspect this is the solution that most people choose (conscious of that choice or not), and it is often not a bad solution. We really do not encounter this problem most of the time. There are other problems we could spend our time solving.

Numbers smaller in magnitude than 253 do not trigger this problem. Where are our long numbers coming from, and how likely are they to fall outside that range?

Auto-incrementing primary keys in a SQL database
Will we insert more than 9,007,199,254,740,992 rows into one table? Knowing nothing at all about our theoretical application, I will venture a guess: "No."
Epoch millisecond timestamps
253 milliseconds has us covered for ±300,000 years, roughly. Are we dealing with dates outside of that range? If we are, perhaps epoch milliseconds are a poor choice for units and we should solve that problem with our units first.
Randomly-generated, unbounded long numbers
The majority of these do not fit in double. If we send these to JavaScript via JSON numbers, we will have a bad time. Are we actually doing that?
User-provided, unbounded long numbers
Most of these numbers should not trigger problems, but some will. The solution may be to add bounds checking on input, filtering out misbehaving numbers before they are used.

No matter what solution (or non-solution) we choose, we should make our choice deliberately. Being oblivious is not the answer.

March 26, 2013

Everything about Java 8

The following post is a comprehensive summary of the developer-facing changes coming in Java 8. As of March 18, 2014, Java 8 is now generally available.

I used preview builds of IntelliJ for my IDE. It had the best support for the Java 8 language features at the time I went looking. You can find those builds here: IntelliJIDEA EAP.

Interface improvements

Interfaces can now define static methods. For instance, a naturalOrder method was added to java.util.Comparator:

public static <T extends Comparable<? super T>>
Comparator<T> naturalOrder() {
    return (Comparator<T>)
        Comparators.NaturalOrderComparator.INSTANCE;
}

A common scenario in Java libraries is, for some interface Foo, there would be a companion utility class Foos with static methods for generating or working with Foo instances. Now that static methods can exist on interfaces, in many cases the Foos utility class can go away (or be made package-private), with its public methods going on the interface instead.

Additionally, more importantly, interfaces can now define default methods. For instance, a forEach method was added to java.lang.Iterable:

public default void forEach(Consumer<? super T> action) {
    Objects.requireNonNull(action);
    for (T t : this) {
        action.accept(t);
    }
}

In the past it was essentially impossible for Java libraries to add methods to interfaces. Adding a method to an interface would mean breaking all existing code that implements the interface. Now, as long as a sensible default implementation of a method can be provided, library maintainers can add methods to these interfaces.

In Java 8, a large number of default methods have been added to core JDK interfaces. I'll discuss many of them later.

Why can't default methods override equals, hashCode, and toString?

An interface cannot provide a default implementation for any of the methods of the Object class. In particular, this means one cannot provide a default implementation for equals, hashCode, or toString from within an interface.

This seems odd at first, given that some interfaces actually define their equals behavior in documentation. The List interface is an example. So, why not allow this?

Brian Goetz gave four reasons in a lengthy response on the Project Lambda mailing list. I'll only describe one here, because that one was enough to convince me:

It would become more difficult to reason about when a default method is invoked. Right now it's simple: if a class implements a method, that always wins over a default implementation. Since all instances of interfaces are Objects, all instances of interfaces have non-default implementations of equals/hashCode/toString already. Therefore, a default version of these on an interface is always useless, and it may as well not compile.

For further reading, see this explanation written by Brian Goetz: response to "Allow default methods to override Object's methods"

Functional interfaces

A core concept introduced in Java 8 is that of a "functional interface". An interface is a functional interface if it defines exactly one abstract method. For instance, java.lang.Runnable is a functional interface because it only defines one abstract method:

public abstract void run();

Note that the "abstract" modifier is implied because the method lacks a body. It is not necessary to specify the "abstract" modifier, as this code does, in order to qualify as a functional interface.

Default methods are not abstract, so a functional interface can define as many default methods as it likes.

A new annotation, @FunctionalInterface, has been introduced. It can be placed on an interface to declare the intention of it being a functional interface. It will cause the interface to refuse to compile unless you've managed to make it a functional interface. It's sort of like @Override in this way; it declares intention and doesn't allow you to use it incorrectly.

Lambdas

An extremely valuable property of functional interfaces is that they can be instantiated using lambdas. Here are a few examples of lambdas:

Comma-separated list of inputs with specified types on the left, a block with a return on the right:

(int x, int y) -> { return x + y; }

Comma-separated list of inputs with inferred types on the left, a return value on the right:

(x, y) -> x + y

Single parameter with inferred type on the left, a return value on the right:

x -> x * x

No inputs on left (official name: "burger arrow"), return value on the right:

() -> x

Single parameter with inferred type on the left, a block with no return (void return) on the right:

x -> { System.out.println(x); }

Static method reference:

String::valueOf

Non-static method reference:

Object::toString

Capturing method reference:

x::toString

Constructor reference:

ArrayList::new

You can think of method reference forms as shorthand for the other lambda forms.

Method reference   Equivalent lambda expression
String::valueOf x -> String.valueOf(x)
Object::toString x -> x.toString()
x::toString () -> x.toString()
ArrayList::new () -> new ArrayList<>()

Of course, methods in Java can be overloaded. Classes can have multiple methods with the same name but different parameters. The same goes for its constructors. ArrayList::new could refer to any of its three constructors. The method it resolves to depends on which functional interface it's being used for.

A lambda is compatible with a given functional interface when their "shapes" match. By "shapes", I'm referring to the types of the inputs, outputs, and declared checked exceptions.

To give a couple of concrete, valid examples:

Comparator<String> c = (a, b) -> Integer.compare(a.length(),
                                                 b.length());

A Comparator<String>'s compare method takes two strings as input, and returns an int. That's consistent with the lambda on the right, so this assignment is valid.

Runnable r = () -> { System.out.println("Running!"); }

A Runnable's run method takes no arguments and does not have a return value. That's consistent with the lambda on the right, so this assignment is valid.

The checked exceptions (if present) in the abstract method's signature matter too. The lambda can only throw a checked exception if the functional interface declares that exception in its signature.

Capturing versus non-capturing lambdas

Lambdas are said to be "capturing" if they access a non-static variable or object that was defined outside of the lambda body. For example, this lambda captures the variable x:

int x = 5;
return y -> x + y;

In order for this lambda declaration to be valid, the variables it captures must be "effectively final". So, either they must be marked with the final modifier, or they must not be modified after they're assigned.

Whether a lambda is capturing or not has implications for performance. A non-capturing lambda is generally going to be more efficient than a capturing one. Although this is not defined in any specifications (as far as I know), and you shouldn't count on it for a program's correctness, a non-capturing lambda only needs to be evaluated once. From then on, it will return an identical instance. Capturing lambdas need to be evaluated every time they're encountered, and currently that performs much like instantiating a new instance of an anonymous class.

What lambdas don't do

There are a few features that lambdas don't provide, which you should keep in mind. They were considered for Java 8 but were not included, for simplicity and due to time constraints.

Non-final variable capture - If a variable is assigned a new value, it can't be used within a lambda. The "final" keyword is not required, but the variable must be "effectively final" (discussed earlier). This code does not compile:

int count = 0;
List<String> strings = Arrays.asList("a", "b", "c");
strings.forEach(s -> {
    count++; // error: can't modify the value of count
});

Exception transparency - If a checked exception may be thrown from inside a lambda, the functional interface must also declare that checked exception can be thrown. The exception is not propogated to the containing method. This code does not compile:

void appendAll(Iterable<String> values, Appendable out)
        throws IOException { // doesn't help with the error
    values.forEach(s -> {
        out.append(s); // error: can't throw IOException here
                       // Consumer.accept(T) doesn't allow it
    });
}

There are ways to work around this, where you can define your own functional interface that extends Consumer and sneaks the IOException through as a RuntimeException. I tried this out in code and found it to be too confusing to be worthwhile.

Control flow (break, early return) - In the forEach examples above, a traditional continue is possible by placing a "return;" statement within the lambda. However, there is no way to break out of the loop or return a value as the result of the containing method from within the lambda. For example:

final String secret = "foo";
boolean containsSecret(Iterable<String> values) {
    values.forEach(s -> {
        if (secret.equals(s)) {
            ??? // want to end the loop and return true, but can't
        }
    });
}

For further reading about these issues, see this explanation written by Brian Goetz: response to "Checked exceptions within Block<T>

Why abstract classes can't be instantiated using a lambda

An abstract class, even if it declares only one abstract method, cannot be instantiated with a lambda.

Two examples of classes with one abstract method are Ordering and CacheLoader from the Guava library. Wouldn't it be nice to be able to declare instances of them using lambdas like this?

Ordering<String> order = (a, b) -> ...;
CacheLoader<String, String> loader = (key) -> ...;

The most common argument against this was that it would add to the difficulty of reading a lambda. Instantiating an abstract class in this way could lead to execution of hidden code: that in the constructor of the abstract class.

Another reason is that it throws out possible optimizations for lambdas. In the future, it may be the case that lambdas are not evaluated into object instances. Letting users declare abstract classes with lambdas would prevent optimizations like this.

Besides, there's an easy workaround. Actually, the two example classes from Guava already demonstrate this workaround. Add factory methods to convert from a lambda to an instance:

Ordering<String> order = Ordering.from((a, b) -> ...);
CacheLoader<String, String> loader =
    CacheLoader.from((key) -> ...);

For further reading, see this explanation written by Brian Goetz: response to "Allow lambdas to implement abstract classes"

java.util.function

Package summary: java.util.function

As demonstrated earlier with Comparator and Runnable, interfaces already defined in the JDK that happen to be functional interfaces are compatible with lambdas. The same goes for any functional interfaces defined in your own code or in third party libraries.

But there are certain forms of functional interfaces that are widely, commonly useful, which did not exist previously in the JDK. A large number of these interfaces have been added to the new java.util.function package. Here are a few:

  • Function<T, R> - take a T as input, return an R as ouput
  • Predicate<T> - take a T as input, return a boolean as output
  • Consumer<T> - take a T as input, perform some action and don't return anything
  • Supplier<T> - with nothing as input, return a T
  • BinaryOperator<T> - take two T's as input, return one T as output, useful for "reduce" operations

Primitive specializations for most of these exist as well. They're provided in int, long, and double forms. For instance:

  • IntConsumer - take an int as input, perform some action and don't return anything

These exist for performance reasons, to avoid boxing and unboxing when the inputs or outputs are primitives.

java.util.stream

Package summary: java.util.stream

The new java.util.stream package provides utilities "to support functional-style operations on streams of values" (quoting the javadoc). Probably the most common way to obtain a stream will be from a collection:

Stream<T> stream = collection.stream();

A stream is something like an iterator. The values "flow past" (analogy to a stream of water) and then they're gone. A stream can only be traversed once, then it's used up. Streams may also be infinite.

Streams can be sequential or parallel. They start off as one and may be switched to the other using stream.sequential() or stream.parallel(). The actions of a sequential stream occur in serial fashion on one thread. The actions of a parallel stream may be happening all at once on multiple threads.

So, what do you do with a stream? Here is the example given in the package javadocs:

int sumOfWeights = blocks.stream().filter(b -> b.getColor() == RED)
                                  .mapToInt(b -> b.getWeight())
                                  .sum();

Note: The above code makes use of a primitive stream, and a sum() method is only available on primitive streams. There will be more detail on primitive streams shortly.

A stream provides a fluent API for transforming values and performing some action on the results. Stream operations are either "intermediate" or "terminal".

  • Intermediate - An intermediate operation keeps the stream open and allows further operations to follow. The filter and map methods in the example above are intermediate operations. The return type of these methods is Stream; they return the current stream to allow chaining of more operations.
  • Terminal - A terminal operation must be the final operation invoked on a stream. Once a terminal operation is invoked, the stream is "consumed" and is no longer usable. The sum method in the example above is a terminal operation.

Usually, dealing with a stream will involve these steps:

  1. Obtain a stream from some source.
  2. Perform one or more intermediate operations.
  3. Perform one terminal operation.

It's likely that you'll want to perform all those steps within one method. That way, you know the properties of the source and the stream and can ensure that it's used properly. You probably don't want to accept arbitrary Stream<T> instances as input to your method because they may have properties you're ill-equipped to deal with, such as being parallel or infinite.

There are a couple more general properties of stream operations to consider:

  • Stateful - A stateful operation imposes some new property on the stream, such as uniqueness of elements, or a maximum number of elements, or ensuring that the elements are consumed in sorted fashion. These are typically more expensive than stateless intermediate operations.
  • Short-circuiting - A short-circuiting operation potentially allows processing of a stream to stop early without examining all the elements. This is an especially desirable property when dealing with infinite streams; if none of the operations being invoked on a stream are short-circuiting, then the code may never terminate.

Here are short, general descriptions for each Stream method. See the javadocs for more thorough explanations. Links are provided below for each overloaded form of the operation.

Intermediate operations:

  • filter 1 - Exclude all elements that don't match a Predicate.
  • map 1 2 3 4 - Perform a one-to-one transformation of elements using a Function.
  • flatMap 1 2 3 4 - Transform each element into zero or more elements by way of another Stream.
  • peek 1 - Perform some action on each element as it is encountered. Primarily useful for debugging.
  • distinct 1 - Exclude all duplicate elements according to their .equals behavior. This is a stateful operation.
  • sorted 1 2 - Ensure that stream elements in subsequent operations are encountered according to the order imposed by a Comparator. This is a stateful operation.
  • limit 1 - Ensure that subsequent operations only see up to a maximum number of elements. This is a stateful, short-circuiting operation.
  • skip 1 - Ensure that subsequent operations do not see the first n elements. This is a stateful operation.

Terminal operations:

  • forEach 1 - Perform some action for each element in the stream.
  • toArray 1 2 - Dump the elements in the stream to an array.
  • reduce 1 2 3 - Combine the stream elements into one using a BinaryOperator.
  • collect 1 2 - Dump the elements in the stream into some container, such as a Collection or Map.
  • min 1 - Find the minimum element of the stream according to a Comparator.
  • max 1 - Find the maximum element of the stream according to a Comparator.
  • count 1 - Find the number of elements in the stream.
  • anyMatch 1 - Find out whether at least one of the elements in the stream matches a Predicate. This is a short-circuiting operation.
  • allMatch 1 - Find out whether every element in the stream matches a Predicate. This is a short-circuiting operation.
  • noneMatch 1 - Find out whether zero elements in the stream match a Predicate. This is a short-circuiting operation.
  • findFirst 1 - Find the first element in the stream. This is a short-circuiting operation.
  • findAny 1 - Find any element in the stream, which may be cheaper than findFirst for some streams. This is a short-circuiting operation.

As noted in the javadocs, intermediate operations are lazy. Only a terminal operation will start the processing of stream elements. At that point, no matter how many intermediate operations were included, the elements are then consumed in (usually, but not quite always) a single pass. (Stateful operations such as sorted() and distinct() may require a second pass over the elements.)

Streams try their best to do as little work as possible. There are micro-optimizations such as eliding a sorted() operation when it can determine the elements are already in order. In operations that include limit(x) or substream(x,y), a stream can sometimes avoid performing intermediate map operations on the elements it knows aren't necessary to determine the result. I'm not going to be able to do the implementation justice here; it's clever in lots of small but significant ways, and it's still improving.

Returning to the concept of parallel streams, it's important to note that parallelism is not free. It's not free from a performance standpoint, and you can't simply swap out a sequential stream for a parallel one and expect the results to be identical without further thought. There are properties to consider about your stream, its operations, and the destination for its data before you can (or should) parallelize a stream. For instance: Does encounter order matter to me? Are my functions stateless? Is my stream large enough and are my operations complex enough to make parallelism worthwhile?

There are primitive-specialized versions of Stream for ints, longs, and doubles:

One can convert back and forth between an object stream and a primitive stream using the primitive-specialized map and flatMap functions, among others. To give a few contrived examples:

List<String> strings = Arrays.asList("a", "b", "c");
strings.stream()                    // Stream<String>
       .mapToInt(String::length)    // IntStream
       .longs()                     // LongStream
       .mapToDouble(x -> x / 10.0)  // DoubleStream
       .boxed()                     // Stream<Double>
       .mapToLong(x -> 1L)          // LongStream
       .mapToObj(x -> "")           // Stream<String>
       ...

The primitive streams also provide methods for obtaining basic numeric statistics about the stream as a data structure. You can find the count, sum, min, max, and mean of the elements all from one terminal operation.

There are not primitive versions for the rest of the primitive types because it would have required an unacceptable amount of bloat in the JDK. IntStream, LongStream, and DoubleStream were deemed useful enough to include, and streams of other numeric primitives can represented using these three via widening primitive conversion.

One of the most confusing, intricate, and useful terminal stream operations is collect. It introduces a new interface called Collector. This interface is somewhat difficult to understand, but fortunately there is a Collectors utility class for generating all sorts of useful Collectors. For example:

List<String> strings = values.stream()
                             .filter(...)
                             .map(...)
                             .collect(Collectors.toList());

If you want to put your stream elements into a Collection, Map, or String, then Collectors probably has what you need. It's definitely worthwhile to browse through the javadoc of that class.

Generic type inference improvements

Summary of proposal: JEP 101: Generalized Target-Type Inference

This was an effort to improve the ability of the compiler to determine generic types where it was previously unable to. There were many cases in previous versions of Java where the compiler could not figure out the generic types for a method in the context of nested or chained method invocations, even when it seemed "obvious" to the programmer. Those situations required the programmer to explicitly specify a "type witness". It's a feature of generics that surprisingly few Java programmers know about (I'm saying this based on personal interactions and reading StackOverflow questions). It looks like this:

// In Java 7:
foo(Utility.<Type>bar());
Utility.<Type>foo().bar();

Without the type witnesses, the compiler might fill in <Object> as the generic type, and the code would fail to compile if a more specific type was required instead.

Java 8 improves this situation tremendously. In many more cases, it can figure out a more specific generic type based on the context.

// In Java 8:
foo(Utility.bar());
Utility.foo().bar();

This one is still a work in progress, so I'm not sure how many of the examples listed in the proposal will actually be included for Java 8. Hopefully it's all of them.

java.time

Package summary: java.time

The new date/time API in Java 8 is contained in the java.time package. If you're familiar with Joda Time, it will be really easy to pick up. Actually, I think it's so well-designed that even people who have never heard of Joda Time should find it easy to pick up.

Almost everything in the API is immutable, including the value types and the formatters. No more worrying about exposing Date fields or dealing with thread-local date formatters.

The intermingling with the legacy date/time API is minimal. It was a clean break:

The new API prefers enums over integer constants for things like months and days of the week.

So, what's in it? The package-level javadocs do an excellent job of explaining the additional types. I'll give a brief rundown of some noteworthy parts.

Extremely useful value types:

Less useful value types:

Other useful types:

  • DateTimeFormatter - for converting datetime objects to strings
  • ChronoUnit - for figuring out the amount of time bewteen two points, e.g. ChronoUnit.DAYS.between(t1, t2)
  • TemporalAdjuster - e.g. date.with(TemporalAdjuster.firstDayOfMonth())

The new value types are, for the most part, supported by JDBC. There are minor exceptions, such as ZonedDateTime which has no counterpart in SQL.

Collections API additions

The fact that interfaces can define default methods allowed the JDK authors to make a large number of additions to the collection API interfaces. Default implementations for these are provided on all the core interfaces, and more efficient or well-behaved overridden implementations were added to all the concrete classes, where applicable.

Here's a list of the new methods:

Also, Iterator.remove() now has a default, throwing implementation, which makes it slightly easier to define unmodifiable iterators.

Collection.stream() and Collection.parallelStream() are the main gateways into the stream API. There are other ways to generate streams, but those are going to be the most common by far.

The addition of List.sort(Comparator) is fantastic. Previously, the way to sort an ArrayList was this:

Collections.sort(list, comparator);

That code, which was your only option in Java 7, was frustratingly inefficient. It would dump the list into an array, sort the array, then use a ListIterator to insert the array contents into the list in new positions.

The default implementation of List.sort(Comparator) still does this, but concrete implementing classes are free to optimize. For instance, ArrayList.sort invokes Arrays.sort on the ArrayList's internal array. CopyOnWriteArrayList does the same.

Performance isn't the only potential gain from these new methods. They can have more desirable semantics, too. For instance, sorting a Collections.synchronizedList() is an atomic operation using list.sort. You can iterate over all its elements as an atomic operation using list.forEach. Previously this was not possible.

Map.computeIfAbsent makes working with multimap-like structures easier:

// Index strings by length:
Map<Integer, List<String>> map = new HashMap<>();
for (String s : strings) {
    map.computeIfAbsent(s.length(),
                        key -> new ArrayList<String>())
       .add(s);
}

// Although in this case the stream API may be a better choice:
Map<Integer, List<String>> map = strings.stream()
    .collect(Collectors.groupingBy(String::length));

Concurrency API additions

ForkJoinPool.commonPool() is the structure that handles all parallel stream operations. It is intended as an easy, good way to obtain a ForkJoinPool/ExecutorService/Executor when you need one.

ConcurrentHashMap<K, V> was completely rewritten. Internally it looks nothing like the version that was in Java 7. Externally it's mostly the same, except it has a large number of bulk operation methods: many forms of reduce, search, and forEach.

ConcurrentHashMap.newKeySet() provides a concurrent java.util.Set implementation. It is essentially another way of writing Collections.newSetFromMap(new ConcurrentHashMap<T, Boolean>()).

StampedLock is a new lock implementation that can probably replace ReentrantReadWriteLock in most cases. It performs better than RRWL when used as a plain read-write lock. Is also provides an API for "optimistic reads", where you obtain a weak, cheap version of a read lock, do the read operation, then check afterwards if your lock was invalidated by a write. There's more detail about this class and its performance in a set of slides put together by Heinz Kabutz (starting about half-way through the set of slides): "Phaser and StampedLock Presentation"

CompletableFuture<T> is a nice implementation of the Future interface that provides a ton of methods for performing (and chaining together) asynchronous tasks. It relies on functional interfaces heavily; lambdas are a big reason this class was worth adding. If you are currently using Guava's Future utilities, such as Futures, ListenableFuture, and SettableFuture, you may want to check out CompletableFuture as a potential replacement.

IO/NIO API additions

Most of these additions give you ways to obtain java.util.stream.Stream from files and InputStreams. They're a bit different from the streams you obtain from regular collections though. For one, they may throw UncheckedIOException. Also, they are instances of streams where using the stream.close() method is necessary. Streams implement AutoCloseable and can therefore be used in try-with-resources statements. Streams also have an onClose(Runnable) intermediate operation that I didn't list in the earlier section about streams. It allows you to attach handlers to a stream that execute when it is closed. Here is an example:

// Print the lines in a file, then "done"
try (Stream lines = Files.lines(path, UTF_8)) {
    lines.onClose(() -> System.out.println("done"))
	     .forEach(System.out::println);
}

Reflection and annotation changes

Annotations are allowed in more places, e.g. List<@Nullable String>. The biggest impact of this is likely to be for static analysis tools such as Sonar and FindBugs.

This JSR 308 website does a better job of explaining the motivation for these changes than I could possibly do: "Type Annotations (JSR 308) and the Checker Framework"

Nashorn JavaScript Engine

Summary of proposal: JEP 174: Nashorn JavaScript Engine

I did not experiment with Nashorn so I know very little beyond what's described in the proposal above. Short version: It's the successor to Rhino. Rhino is old and a little bit slow, and the developers decided they'd be better off starting from scratch.

Other miscellaneous additions to java.lang, java.util, and elsewhere

There is too much there to talk about, but I'll pick out a few noteworthy items.

ThreadLocal.withInitial(Supplier<T>) makes declaring thread-local variables with initial values much nicer. Previously you would supply an initial value like this:

ThreadLocal<List<String>> strings =
    new ThreadLocal<List<String>>() {
        @Override
        protected List<String> initialValue() {
             return new ArrayList<>();
        }
    };

Now it's like this:

ThreadLocal<List<String>> strings =
    ThreadLocal.withInital(ArrayList::new);

Optional<T> appears in the stream API as the return value for methods like min/max, findFirst/Any, and some forms of reduce. It's used because there might not be any elements in the stream, and it provides a fluent API for handling the "some result" versus "no result" cases. You can provide a default value, throw an exception, or execute some action only if the result exists.

It's very, very similar to Guava's Optional class. It's nothing at all like Option in Scala, nor is it trying to be, and the name similarity there is purely coincidental.

Aside: it's interesting that Java 8's Optional and Guava's Optional ended up being so similar, despite the absurd amount of debate that occurred over its addition to both libraries.

"FYI.... Optional was the cause of possibly the single greatest conflagration on the internal Java libraries discussion lists ever."

Kevin Bourrillion in response to "Some new Guava classes targeted for release 10"

"On a purely practical note, the discussions surrounding Optional have exceeded its design budget by several orders of magnitude."

Brian Goetz in response to "Optional require(s) NonNull"

StringJoiner and String.join(...) are long, long overdue. They are so long overdue that the vast majority of Java developers likely have already written or have found utilities for joining strings, but it is nice for the JDK to finally provide this itself. Everyone has encountered situations where joining strings is required, and it is a Good Thing™ that we can now express that through a standard API that every Java developer (eventually) will know.

Comparator provides some very nice new methods for doing chained comparisons and field-based comparisons. For example:

people.sort(
    Comparator.comparing(Person::getLastName)
        .thenComparing(Person::getFirstName)
        .thenComparing(
            Person::getEmailAddress,
            Comparator.nullsLast(CASE_INSENSITIVE_ORDER)));

These additions provide good, readable shorthand for complex sorts. Many of the use cases served by Guava's ComparisonChain and Ordering utility classes are now served by these JDK additions. And for what it's worth, I think the JDK verions read better than the functionally-equivalent versions expressed in Guava-ese.

More?

There are lots of various small bug fixes and performance improvements that were not covered in this post. But they are appreciated too!

This post was intended to cover every single language-level and API-level change coming in Java 8. If any were missed, it was an error that should be corrected. Please let me know if you discover an omission. You can contact me via e-mail.