Blog

March 26, 2013

Everything about Java 8

The following post is a comprehensive summary of the developer-facing changes coming in Java 8. As of March 18, 2014, Java 8 is now generally available.

I used preview builds of IntelliJ for my IDE. It had the best support for the Java 8 language features at the time I went looking. You can find those builds here: IntelliJIDEA EAP.

Interface improvements

Interfaces can now define static methods. For instance, a naturalOrder method was added to java.util.Comparator:

public static <T extends Comparable<? super T>>
Comparator<T> naturalOrder() {
    return (Comparator<T>)
        Comparators.NaturalOrderComparator.INSTANCE;
}

A common scenario in Java libraries is, for some interface Foo, there would be a companion utility class Foos with static methods for generating or working with Foo instances. Now that static methods can exist on interfaces, in many cases the Foos utility class can go away (or be made package-private), with its public methods going on the interface instead.

Additionally, more importantly, interfaces can now define default methods. For instance, a forEach method was added to java.lang.Iterable:

public default void forEach(Consumer<? super T> action) {
    Objects.requireNonNull(action);
    for (T t : this) {
        action.accept(t);
    }
}

In the past it was essentially impossible for Java libraries to add methods to interfaces. Adding a method to an interface would mean breaking all existing code that implements the interface. Now, as long as a sensible default implementation of a method can be provided, library maintainers can add methods to these interfaces.

In Java 8, a large number of default methods have been added to core JDK interfaces. I'll discuss many of them later.

Why can't default methods override equals, hashCode, and toString?

An interface cannot provide a default implementation for any of the methods of the Object class. In particular, this means one cannot provide a default implementation for equals, hashCode, or toString from within an interface.

This seems odd at first, given that some interfaces actually define their equals behavior in documentation. The List interface is an example. So, why not allow this?

Brian Goetz gave four reasons in a lengthy response on the Project Lambda mailing list. I'll only describe one here, because that one was enough to convince me:

It would become more difficult to reason about when a default method is invoked. Right now it's simple: if a class implements a method, that always wins over a default implementation. Since all instances of interfaces are Objects, all instances of interfaces have non-default implementations of equals/hashCode/toString already. Therefore, a default version of these on an interface is always useless, and it may as well not compile.

For further reading, see this explanation written by Brian Goetz: response to "Allow default methods to override Object's methods"

Functional interfaces

A core concept introduced in Java 8 is that of a "functional interface". An interface is a functional interface if it defines exactly one abstract method. For instance, java.lang.Runnable is a functional interface because it only defines one abstract method:

public abstract void run();

Note that the "abstract" modifier is implied because the method lacks a body. It is not necessary to specify the "abstract" modifier, as this code does, in order to qualify as a functional interface.

Default methods are not abstract, so a functional interface can define as many default methods as it likes.

A new annotation, @FunctionalInterface, has been introduced. It can be placed on an interface to declare the intention of it being a functional interface. It will cause the interface to refuse to compile unless you've managed to make it a functional interface. It's sort of like @Override in this way; it declares intention and doesn't allow you to use it incorrectly.

Lambdas

An extremely valuable property of functional interfaces is that they can be instantiated using lambdas. Here are a few examples of lambdas:

Comma-separated list of inputs with specified types on the left, a block with a return on the right:

(int x, int y) -> { return x + y; }

Comma-separated list of inputs with inferred types on the left, a return value on the right:

(x, y) -> x + y

Single parameter with inferred type on the left, a return value on the right:

x -> x * x

No inputs on left (official name: "burger arrow"), return value on the right:

() -> x

Single parameter with inferred type on the left, a block with no return (void return) on the right:

x -> { System.out.println(x); }

Static method reference:

String::valueOf

Non-static method reference:

Object::toString

Capturing method reference:

x::toString

Constructor reference:

ArrayList::new

You can think of method reference forms as shorthand for the other lambda forms.

Method reference   Equivalent lambda expression
String::valueOf x -> String.valueOf(x)
Object::toString x -> x.toString()
x::toString () -> x.toString()
ArrayList::new () -> new ArrayList<>()

Of course, methods in Java can be overloaded. Classes can have multiple methods with the same name but different parameters. The same goes for its constructors. ArrayList::new could refer to any of its three constructors. The method it resolves to depends on which functional interface it's being used for.

A lambda is compatible with a given functional interface when their "shapes" match. By "shapes", I'm referring to the types of the inputs, outputs, and declared checked exceptions.

To give a couple of concrete, valid examples:

Comparator<String> c = (a, b) -> Integer.compare(a.length(),
                                                 b.length());

A Comparator<String>'s compare method takes two strings as input, and returns an int. That's consistent with the lambda on the right, so this assignment is valid.

Runnable r = () -> { System.out.println("Running!"); }

A Runnable's run method takes no arguments and does not have a return value. That's consistent with the lambda on the right, so this assignment is valid.

The checked exceptions (if present) in the abstract method's signature matter too. The lambda can only throw a checked exception if the functional interface declares that exception in its signature.

Capturing versus non-capturing lambdas

Lambdas are said to be "capturing" if they access a non-static variable or object that was defined outside of the lambda body. For example, this lambda captures the variable x:

int x = 5;
return y -> x + y;

In order for this lambda declaration to be valid, the variables it captures must be "effectively final". So, either they must be marked with the final modifier, or they must not be modified after they're assigned.

Whether a lambda is capturing or not has implications for performance. A non-capturing lambda is generally going to be more efficient than a capturing one. Although this is not defined in any specifications (as far as I know), and you shouldn't count on it for a program's correctness, a non-capturing lambda only needs to be evaluated once. From then on, it will return an identical instance. Capturing lambdas need to be evaluated every time they're encountered, and currently that performs much like instantiating a new instance of an anonymous class.

What lambdas don't do

There are a few features that lambdas don't provide, which you should keep in mind. They were considered for Java 8 but were not included, for simplicity and due to time constraints.

Non-final variable capture - If a variable is assigned a new value, it can't be used within a lambda. The "final" keyword is not required, but the variable must be "effectively final" (discussed earlier). This code does not compile:

int count = 0;
List<String> strings = Arrays.asList("a", "b", "c");
strings.forEach(s -> {
    count++; // error: can't modify the value of count
});

Exception transparency - If a checked exception may be thrown from inside a lambda, the functional interface must also declare that checked exception can be thrown. The exception is not propogated to the containing method. This code does not compile:

void appendAll(Iterable<String> values, Appendable out)
        throws IOException { // doesn't help with the error
    values.forEach(s -> {
        out.append(s); // error: can't throw IOException here
                       // Consumer.accept(T) doesn't allow it
    });
}

There are ways to work around this, where you can define your own functional interface that extends Consumer and sneaks the IOException through as a RuntimeException. I tried this out in code and found it to be too confusing to be worthwhile.

Control flow (break, early return) - In the forEach examples above, a traditional continue is possible by placing a "return;" statement within the lambda. However, there is no way to break out of the loop or return a value as the result of the containing method from within the lambda. For example:

final String secret = "foo";
boolean containsSecret(Iterable<String> values) {
    values.forEach(s -> {
        if (secret.equals(s)) {
            ??? // want to end the loop and return true, but can't
        }
    });
}

For further reading about these issues, see this explanation written by Brian Goetz: response to "Checked exceptions within Block<T>

Why abstract classes can't be instantiated using a lambda

An abstract class, even if it declares only one abstract method, cannot be instantiated with a lambda.

Two examples of classes with one abstract method are Ordering and CacheLoader from the Guava library. Wouldn't it be nice to be able to declare instances of them using lambdas like this?

Ordering<String> order = (a, b) -> ...;
CacheLoader<String, String> loader = (key) -> ...;

The most common argument against this was that it would add to the difficulty of reading a lambda. Instantiating an abstract class in this way could lead to execution of hidden code: that in the constructor of the abstract class.

Another reason is that it throws out possible optimizations for lambdas. In the future, it may be the case that lambdas are not evaluated into object instances. Letting users declare abstract classes with lambdas would prevent optimizations like this.

Besides, there's an easy workaround. Actually, the two example classes from Guava already demonstrate this workaround. Add factory methods to convert from a lambda to an instance:

Ordering<String> order = Ordering.from((a, b) -> ...);
CacheLoader<String, String> loader =
    CacheLoader.from((key) -> ...);

For further reading, see this explanation written by Brian Goetz: response to "Allow lambdas to implement abstract classes"

java.util.function

Package summary: java.util.function

As demonstrated earlier with Comparator and Runnable, interfaces already defined in the JDK that happen to be functional interfaces are compatible with lambdas. The same goes for any functional interfaces defined in your own code or in third party libraries.

But there are certain forms of functional interfaces that are widely, commonly useful, which did not exist previously in the JDK. A large number of these interfaces have been added to the new java.util.function package. Here are a few:

  • Function<T, R> - take a T as input, return an R as ouput
  • Predicate<T> - take a T as input, return a boolean as output
  • Consumer<T> - take a T as input, perform some action and don't return anything
  • Supplier<T> - with nothing as input, return a T
  • BinaryOperator<T> - take two T's as input, return one T as output, useful for "reduce" operations

Primitive specializations for most of these exist as well. They're provided in int, long, and double forms. For instance:

  • IntConsumer - take an int as input, perform some action and don't return anything

These exist for performance reasons, to avoid boxing and unboxing when the inputs or outputs are primitives.

java.util.stream

Package summary: java.util.stream

The new java.util.stream package provides utilities "to support functional-style operations on streams of values" (quoting the javadoc). Probably the most common way to obtain a stream will be from a collection:

Stream<T> stream = collection.stream();

A stream is something like an iterator. The values "flow past" (analogy to a stream of water) and then they're gone. A stream can only be traversed once, then it's used up. Streams may also be infinite.

Streams can be sequential or parallel. They start off as one and may be switched to the other using stream.sequential() or stream.parallel(). The actions of a sequential stream occur in serial fashion on one thread. The actions of a parallel stream may be happening all at once on multiple threads.

So, what do you do with a stream? Here is the example given in the package javadocs:

int sumOfWeights = blocks.stream().filter(b -> b.getColor() == RED)
                                  .mapToInt(b -> b.getWeight())
                                  .sum();

Note: The above code makes use of a primitive stream, and a sum() method is only available on primitive streams. There will be more detail on primitive streams shortly.

A stream provides a fluent API for transforming values and performing some action on the results. Stream operations are either "intermediate" or "terminal".

  • Intermediate - An intermediate operation keeps the stream open and allows further operations to follow. The filter and map methods in the example above are intermediate operations. The return type of these methods is Stream; they return the current stream to allow chaining of more operations.
  • Terminal - A terminal operation must be the final operation invoked on a stream. Once a terminal operation is invoked, the stream is "consumed" and is no longer usable. The sum method in the example above is a terminal operation.

Usually, dealing with a stream will involve these steps:

  1. Obtain a stream from some source.
  2. Perform one or more intermediate operations.
  3. Perform one terminal operation.

It's likely that you'll want to perform all those steps within one method. That way, you know the properties of the source and the stream and can ensure that it's used properly. You probably don't want to accept arbitrary Stream<T> instances as input to your method because they may have properties you're ill-equipped to deal with, such as being parallel or infinite.

There are a couple more general properties of stream operations to consider:

  • Stateful - A stateful operation imposes some new property on the stream, such as uniqueness of elements, or a maximum number of elements, or ensuring that the elements are consumed in sorted fashion. These are typically more expensive than stateless intermediate operations.
  • Short-circuiting - A short-circuiting operation potentially allows processing of a stream to stop early without examining all the elements. This is an especially desirable property when dealing with infinite streams; if none of the operations being invoked on a stream are short-circuiting, then the code may never terminate.

Here are short, general descriptions for each Stream method. See the javadocs for more thorough explanations. Links are provided below for each overloaded form of the operation.

Intermediate operations:

  • filter 1 - Exclude all elements that don't match a Predicate.
  • map 1 2 3 4 - Perform a one-to-one transformation of elements using a Function.
  • flatMap 1 2 3 4 - Transform each element into zero or more elements by way of another Stream.
  • peek 1 - Perform some action on each element as it is encountered. Primarily useful for debugging.
  • distinct 1 - Exclude all duplicate elements according to their .equals behavior. This is a stateful operation.
  • sorted 1 2 - Ensure that stream elements in subsequent operations are encountered according to the order imposed by a Comparator. This is a stateful operation.
  • limit 1 - Ensure that subsequent operations only see up to a maximum number of elements. This is a stateful, short-circuiting operation.
  • skip 1 - Ensure that subsequent operations do not see the first n elements. This is a stateful operation.

Terminal operations:

  • forEach 1 - Perform some action for each element in the stream.
  • toArray 1 2 - Dump the elements in the stream to an array.
  • reduce 1 2 3 - Combine the stream elements into one using a BinaryOperator.
  • collect 1 2 - Dump the elements in the stream into some container, such as a Collection or Map.
  • min 1 - Find the minimum element of the stream according to a Comparator.
  • max 1 - Find the maximum element of the stream according to a Comparator.
  • count 1 - Find the number of elements in the stream.
  • anyMatch 1 - Find out whether at least one of the elements in the stream matches a Predicate. This is a short-circuiting operation.
  • allMatch 1 - Find out whether every element in the stream matches a Predicate. This is a short-circuiting operation.
  • noneMatch 1 - Find out whether zero elements in the stream match a Predicate. This is a short-circuiting operation.
  • findFirst 1 - Find the first element in the stream. This is a short-circuiting operation.
  • findAny 1 - Find any element in the stream, which may be cheaper than findFirst for some streams. This is a short-circuiting operation.

As noted in the javadocs, intermediate operations are lazy. Only a terminal operation will start the processing of stream elements. At that point, no matter how many intermediate operations were included, the elements are then consumed in (usually, but not quite always) a single pass. (Stateful operations such as sorted() and distinct() may require a second pass over the elements.)

Streams try their best to do as little work as possible. There are micro-optimizations such as eliding a sorted() operation when it can determine the elements are already in order. In operations that include limit(x) or substream(x,y), a stream can sometimes avoid performing intermediate map operations on the elements it knows aren't necessary to determine the result. I'm not going to be able to do the implementation justice here; it's clever in lots of small but significant ways, and it's still improving.

Returning to the concept of parallel streams, it's important to note that parallelism is not free. It's not free from a performance standpoint, and you can't simply swap out a sequential stream for a parallel one and expect the results to be identical without further thought. There are properties to consider about your stream, its operations, and the destination for its data before you can (or should) parallelize a stream. For instance: Does encounter order matter to me? Are my functions stateless? Is my stream large enough and are my operations complex enough to make parallelism worthwhile?

There are primitive-specialized versions of Stream for ints, longs, and doubles:

One can convert back and forth between an object stream and a primitive stream using the primitive-specialized map and flatMap functions, among others. To give a few contrived examples:

List<String> strings = Arrays.asList("a", "b", "c");
strings.stream()                    // Stream<String>
       .mapToInt(String::length)    // IntStream
       .longs()                     // LongStream
       .mapToDouble(x -> x / 10.0)  // DoubleStream
       .boxed()                     // Stream<Double>
       .mapToLong(x -> 1L)          // LongStream
       .mapToObj(x -> "")           // Stream<String>
       ...

The primitive streams also provide methods for obtaining basic numeric statistics about the stream as a data structure. You can find the count, sum, min, max, and mean of the elements all from one terminal operation.

There are not primitive versions for the rest of the primitive types because it would have required an unacceptable amount of bloat in the JDK. IntStream, LongStream, and DoubleStream were deemed useful enough to include, and streams of other numeric primitives can represented using these three via widening primitive conversion.

One of the most confusing, intricate, and useful terminal stream operations is collect. It introduces a new interface called Collector. This interface is somewhat difficult to understand, but fortunately there is a Collectors utility class for generating all sorts of useful Collectors. For example:

List<String> strings = values.stream()
                             .filter(...)
                             .map(...)
                             .collect(Collectors.toList());

If you want to put your stream elements into a Collection, Map, or String, then Collectors probably has what you need. It's definitely worthwhile to browse through the javadoc of that class.

Generic type inference improvements

Summary of proposal: JEP 101: Generalized Target-Type Inference

This was an effort to improve the ability of the compiler to determine generic types where it was previously unable to. There were many cases in previous versions of Java where the compiler could not figure out the generic types for a method in the context of nested or chained method invocations, even when it seemed "obvious" to the programmer. Those situations required the programmer to explicitly specify a "type witness". It's a feature of generics that surprisingly few Java programmers know about (I'm saying this based on personal interactions and reading StackOverflow questions). It looks like this:

// In Java 7:
foo(Utility.<Type>bar());
Utility.<Type>foo().bar();

Without the type witnesses, the compiler might fill in <Object> as the generic type, and the code would fail to compile if a more specific type was required instead.

Java 8 improves this situation tremendously. In many more cases, it can figure out a more specific generic type based on the context.

// In Java 8:
foo(Utility.bar());
Utility.foo().bar();

This one is still a work in progress, so I'm not sure how many of the examples listed in the proposal will actually be included for Java 8. Hopefully it's all of them.

java.time

Package summary: java.time

The new date/time API in Java 8 is contained in the java.time package. If you're familiar with Joda Time, it will be really easy to pick up. Actually, I think it's so well-designed that even people who have never heard of Joda Time should find it easy to pick up.

Almost everything in the API is immutable, including the value types and the formatters. No more worrying about exposing Date fields or dealing with thread-local date formatters.

The intermingling with the legacy date/time API is minimal. It was a clean break:

The new API prefers enums over integer constants for things like months and days of the week.

So, what's in it? The package-level javadocs do an excellent job of explaining the additional types. I'll give a brief rundown of some noteworthy parts.

Extremely useful value types:

Less useful value types:

Other useful types:

  • DateTimeFormatter - for converting datetime objects to strings
  • ChronoUnit - for figuring out the amount of time bewteen two points, e.g. ChronoUnit.DAYS.between(t1, t2)
  • TemporalAdjuster - e.g. date.with(TemporalAdjuster.firstDayOfMonth())

The new value types are, for the most part, supported by JDBC. There are minor exceptions, such as ZonedDateTime which has no counterpart in SQL.

Collections API additions

The fact that interfaces can define default methods allowed the JDK authors to make a large number of additions to the collection API interfaces. Default implementations for these are provided on all the core interfaces, and more efficient or well-behaved overridden implementations were added to all the concrete classes, where applicable.

Here's a list of the new methods:

Also, Iterator.remove() now has a default, throwing implementation, which makes it slightly easier to define unmodifiable iterators.

Collection.stream() and Collection.parallelStream() are the main gateways into the stream API. There are other ways to generate streams, but those are going to be the most common by far.

The addition of List.sort(Comparator) is fantastic. Previously, the way to sort an ArrayList was this:

Collections.sort(list, comparator);

That code, which was your only option in Java 7, was frustratingly inefficient. It would dump the list into an array, sort the array, then use a ListIterator to insert the array contents into the list in new positions.

The default implementation of List.sort(Comparator) still does this, but concrete implementing classes are free to optimize. For instance, ArrayList.sort invokes Arrays.sort on the ArrayList's internal array. CopyOnWriteArrayList does the same.

Performance isn't the only potential gain from these new methods. They can have more desirable semantics, too. For instance, sorting a Collections.synchronizedList() is an atomic operation using list.sort. You can iterate over all its elements as an atomic operation using list.forEach. Previously this was not possible.

Map.computeIfAbsent makes working with multimap-like structures easier:

// Index strings by length:
Map<Integer, List<String>> map = new HashMap<>();
for (String s : strings) {
    map.computeIfAbsent(s.length(),
                        key -> new ArrayList<String>())
       .add(s);
}

// Although in this case the stream API may be a better choice:
Map<Integer, List<String>> map = strings.stream()
    .collect(Collectors.groupingBy(String::length));

Concurrency API additions

ForkJoinPool.commonPool() is the structure that handles all parallel stream operations. It is intended as an easy, good way to obtain a ForkJoinPool/ExecutorService/Executor when you need one.

ConcurrentHashMap<K, V> was completely rewritten. Internally it looks nothing like the version that was in Java 7. Externally it's mostly the same, except it has a large number of bulk operation methods: many forms of reduce, search, and forEach.

ConcurrentHashMap.newKeySet() provides a concurrent java.util.Set implementation. It is essentially another way of writing Collections.newSetFromMap(new ConcurrentHashMap<T, Boolean>()).

StampedLock is a new lock implementation that can probably replace ReentrantReadWriteLock in most cases. It performs better than RRWL when used as a plain read-write lock. Is also provides an API for "optimistic reads", where you obtain a weak, cheap version of a read lock, do the read operation, then check afterwards if your lock was invalidated by a write. There's more detail about this class and its performance in a set of slides put together by Heinz Kabutz (starting about half-way through the set of slides): "Phaser and StampedLock Presentation"

CompletableFuture<T> is a nice implementation of the Future interface that provides a ton of methods for performing (and chaining together) asynchronous tasks. It relies on functional interfaces heavily; lambdas are a big reason this class was worth adding. If you are currently using Guava's Future utilities, such as Futures, ListenableFuture, and SettableFuture, you may want to check out CompletableFuture as a potential replacement.

IO/NIO API additions

Most of these additions give you ways to obtain java.util.stream.Stream from files and InputStreams. They're a bit different from the streams you obtain from regular collections though. For one, they may throw UncheckedIOException. Also, they are instances of streams where using the stream.close() method is necessary. Streams implement AutoCloseable and can therefore be used in try-with-resources statements. Streams also have an onClose(Runnable) intermediate operation that I didn't list in the earlier section about streams. It allows you to attach handlers to a stream that execute when it is closed. Here is an example:

// Print the lines in a file, then "done"
try (Stream lines = Files.lines(path, UTF_8)) {
    lines.onClose(() -> System.out.println("done"))
	     .forEach(System.out::println);
}

Reflection and annotation changes

Annotations are allowed in more places, e.g. List<@Nullable String>. The biggest impact of this is likely to be for static analysis tools such as Sonar and FindBugs.

This JSR 308 website does a better job of explaining the motivation for these changes than I could possibly do: "Type Annotations (JSR 308) and the Checker Framework"

Nashorn JavaScript Engine

Summary of proposal: JEP 174: Nashorn JavaScript Engine

I did not experiment with Nashorn so I know very little beyond what's described in the proposal above. Short version: It's the successor to Rhino. Rhino is old and a little bit slow, and the developers decided they'd be better off starting from scratch.

Other miscellaneous additions to java.lang, java.util, and elsewhere

There is too much there to talk about, but I'll pick out a few noteworthy items.

ThreadLocal.withInitial(Supplier<T>) makes declaring thread-local variables with initial values much nicer. Previously you would supply an initial value like this:

ThreadLocal<List<String>> strings =
    new ThreadLocal<List<String>>() {
        @Override
        protected List<String> initialValue() {
             return new ArrayList<>();
        }
    };

Now it's like this:

ThreadLocal<List<String>> strings =
    ThreadLocal.withInital(ArrayList::new);

Optional<T> appears in the stream API as the return value for methods like min/max, findFirst/Any, and some forms of reduce. It's used because there might not be any elements in the stream, and it provides a fluent API for handling the "some result" versus "no result" cases. You can provide a default value, throw an exception, or execute some action only if the result exists.

It's very, very similar to Guava's Optional class. It's nothing at all like Option in Scala, nor is it trying to be, and the name similarity there is purely coincidental.

Aside: it's interesting that Java 8's Optional and Guava's Optional ended up being so similar, despite the absurd amount of debate that occurred over its addition to both libraries.

"FYI.... Optional was the cause of possibly the single greatest conflagration on the internal Java libraries discussion lists ever."

Kevin Bourrillion in response to "Some new Guava classes targeted for release 10"

"On a purely practical note, the discussions surrounding Optional have exceeded its design budget by several orders of magnitude."

Brian Goetz in response to "Optional require(s) NonNull"

StringJoiner and String.join(...) are long, long overdue. They are so long overdue that the vast majority of Java developers likely have already written or have found utilities for joining strings, but it is nice for the JDK to finally provide this itself. Everyone has encountered situations where joining strings is required, and it is a Good Thing™ that we can now express that through a standard API that every Java developer (eventually) will know.

Comparator provides some very nice new methods for doing chained comparisons and field-based comparisons. For example:

people.sort(
    Comparator.comparing(Person::getLastName)
        .thenComparing(Person::getFirstName)
        .thenComparing(
            Person::getEmailAddress,
            Comparator.nullsLast(CASE_INSENSITIVE_ORDER)));

These additions provide good, readable shorthand for complex sorts. Many of the use cases served by Guava's ComparisonChain and Ordering utility classes are now served by these JDK additions. And for what it's worth, I think the JDK verions read better than the functionally-equivalent versions expressed in Guava-ese.

More?

There are lots of various small bug fixes and performance improvements that were not covered in this post. But they are appreciated too!

This post was intended to cover every single language-level and API-level change coming in Java 8. If any were missed, it was an error that should be corrected. Please let me know if you discover an omission. You can contact me via e-mail.

January 8, 2013

Storage worries

In the 1980s, high-tech companies stored information about their customers on their sophisticated and high-cost computer equipment. Back then such practices were exceptional except at relatively large companies. Thirty years later, it's so commonplace that there are numerous services to store customer data for you "in the cloud."

A layperson would be excused to think that in 2012, questions about how to store data on computers have been worked out. The established options have indeed remained consistent for years, decades. The commonplace language for structuring, storing, finding, and fetching data-SQL-has been with us a very long time. So long, in fact, that layers of conventional thinking have been built up. More recently, a healthy desire to shake off that conventional thinking and re-imagine data storage has emerged.

Data storage is enjoying a deserved renaissance thanks to the research and hands-on work of a number of people who together are informally known as the NoSQL movement.

The good stuff

The objectives of each NoSQL implementation vary, but generally, the appeal to developers (like us) comes from these common advantages versus traditional SQL:

  • Thank goodness, there is no SQL language to deal with. The APIs are purpose-built for modern notions of structure, store, find, and retrieve. That usually means one fewer layer of translation between stored data and live in-memory data.
  • Clients and servers generally communicate over human-readable protocols like HTTP. This makes us happy because we know this protocol and we don't know old database protocols like TDS.
  • As our server(s) reach their request-processing limits, we can theoretically add more with relatively little pain.
  • Data is automatically duplicated according to tunable rules so that we don't panic (severely) when a server blows up.
  • Some implementations can distribute complex queries to multiple servers automatically.

These advantages need to be evaluated in context. For us, the context is building technology solutions that meet business objectives for our clients. From that perspective, they reduce into:

  • A fresh approach to integration with applications, which in some cases may decrease programmer effort. Decreased effort translates into lower implementation costs.
  • Less cumbersome resolution to future scale events.

Our clients rightfully don't care how hip using a NoSQL server makes us feel. They don't care precisely how NoSQL may reduce the pain of future scaling, but knowing that future scale is somewhat less painful does give them some comfort.

The not-so-good stuff

This context sheds some light on disadvantages that are often downplayed when building an application in-house. As mentioned earlier, internal teams are afforded the luxury (whether or not their bosses would necessarily agree) to adopt new technologies with less consideration of risk.

Higher than average risk-aversion and a horizon of hand-off to an internal team means we consider the following:

  • There are few standards (yet) with NoSQL.
  • Community popularity is slowly coalescing but volatility remains.
  • Finding the right team in the future may be difficult for our client.
  • Reality is that most clients don't have a reasonable expectation of scale that suggest NoSQL.
  • Real-time performance characteristics may be an issue.

Let's start with scale. The lure of smooth, no-pain scaling will appeal to anyone who has been through efforts to scale traditional databases. Talking about scale is important even early on in a project's lifecycle. However, the usage level where scaling a traditional database server by "throwing better hardware at it" stops being practical is quite high. A single modern high-performance server can process a tremendous amount of user data, at least from the point of view of a start-up company.

Favoring ease of scalability in selection of a data storage platform makes sense if the scale plans are real. But if scale plans are imaginary, hopeful, ambitious, or wishful thinking, ease of scalability is not as important. Everyone wants to believe they will be the next Facebook. But more likely you'll be the next site with ten thousand users working really hard to get to your first one hundred thousand.

In that 10,000 users to 100,000 users bracket, many applications' entire data set fits in system memory. Performance is going to be reasonably quick (at least in terms of basic get and put operations) with any kind of database.

How about ease of development? Working with NoSQL databases can be more "fluent" because the native APIs of NoSQL discard database legacy and focus on basic verbs like put and get. As a result, developer efficiency may be slightly improved with a NoSQL platform.

Traditional relational databases are encumbered by an impedance mismatch between in-memory objects and relational data entities. This mismatch lead to object-relational mapping tools (ORMs) and subsequent debates between fans and critics of ORMs. However, lightweight ORMs reduce the most commonplace database operations into interfaces as fluent as those offered by NoSQL.

Ultimately, it's essentially a wash in terms of comparing the level of effort. NoSQL may enjoy a small advantage in the form of developer happiness: most developers innately like working with new, cool things.

Where do we land?

Clearly, the decision is predicated on the specifics of the application. Systems with pre-existing large scale are generally well-fitted to NoSQL. Systems with anticipated but undefined analytical needs are generally better served by SQL. Often we select traditional SQL databases because we want clients to be well-positioned to deal with unknowns and traditional databases are not performance slouches.

In other words, with fairly vanilla application requirements and scale targets, a traditional SQL database avoids some degree of risk, and risk moderation is compelling even in small doses.

If and when the client wants to build an internal team, it will be easier to find developers with the necessary experience.

Traditional platforms are more stable. For example, although the risk of being abandoned by a popular NoSQL platform are low, there is virtually no risk of MySQL, Postgres, or Microsoft SQL Server being outright abandoned in the next decade. That sort of huge horizon is unnecessary, but still comforting.

Finally, scaling a traditional database may be slightly more difficult than a NoSQL option, but not sufficiently to be a factor unless scale-growth concerns are well justified.

January 7, 2013

Pragmatism with Flavor

We build web and mobile applications, and we've been doing so for a little over fifteen years. To our delight, technology has evolved and improved in many ways over that time.

Technology evolution can be cyclical, with some branches looping back to the past with surprising vigor and without much self-awareness. Platforms eschew threading and then re-invent threading anew. Client-side code is passé and then the client becomes the preferred runtime environment.

As software developers, we're technophiles, so we enjoy these cycles and quite often find humor listening to the energy spent arguing on either side of issues. It's a fun and funny business to be in.

Through our fifteen years, we've been charged with helping clients get business done with technology. Focusing on the metrics that ultimately drive the business, often the bottom line, has provided a reality check, a counter-weight to the desire to consume all new technologies. Over time, we've imbibed that reality check serum, and now it's part of our instinctual technical point of view.

We love technology, but...

Sure we still get excited and animated about a new JavaScript library, a new CSS trick, a new device, or a new data store. We've learned to think about how this really fits with the business needs of our clients. Is putting this hot new technology to use on a client project for our technical interest or is it for their business interest? Does it help our clients sell more widgets or deliver with fewer headaches? Or do we want to play with this new toy simply because playing with new stuff is fun?

Not to downplay developer happiness. It's important to us. But a newly minted developer given access to Hacker News can develop ADD in a week's time, waste countless hours reading the opinions of professional and amateur technology opinion-makers, and nervously fidget about selecting the hippest framework and libraries to (briefly) avoid the scorn of hacker pop culture.

Anyone who has been around the web development block is going to tell you that they select technologies with some balance between coolness and pragmatism. The web development continuum has sound, safe, but stodgy on one side and shiny-obsession at the risk of unknown stability on the other. Few developers will find themselves at an extreme or precisely in the middle, and we're no exception.

Our business model nudges us slightly toward the sound, safe (stodgy?) side. We work with clients to get them off the ground with a platform to build their business. As the technical partner, we want to be real partners in the business-making sure the technology helps achieve the business objectives. Incurring additional risk based on jumping on the latest and greatest technology is most often not the best route. Our clients want us to make it safe to innovate. They have enough worries.

Skewing us toward pragmatism is a combination of factors:

  • Our clients are not about technology; they are about running their business using technology. So even if that business doesn't exist as anything but a web site, we want them to avoid unnecessary technology risks.
  • Although we spend plenty of time researching and developing with new technologies based on desire and need, we don't want our clients to be guinea pigs (unless that's the nature of their business, of course).
  • We want our clients' business to be supported by technology that is fast, lightweight, and pain-free to administrate.

Easy stuff should be easy

We cut our teeth building eharmony.com back when a server with 128MB of memory was a monster. This gives us a point of view about servers and modern web application development that isn't terribly common these days: each server in your cluster should be able to process a significant amount of user load. On modern hardware, a trivial dynamic request should run as close to 0 milliseconds of server-side processing as possible; and it is unacceptable for a request to take 200+ milliseconds unless it's doing some heavy lifting.

After all, napkin math tells us that if requests take 200 milliseconds, only five requests per second would saturate a processor core. Those requests had better be worth it!

A site with a million users is much easier to manage today than it was in 2000. Today, a million users should not require dozens of servers unless the business is by its nature compute-heavy.

As with premature optimization, premature scaling is probably unnecessary and may be harmful. Most importantly, you don't necessarily know if you're scaling in the right direction. On the other hand it's silly to dogmatically avoid simple patterns that reap the easiest initial salve for scale: processing requests quickly.

Imagine your prototyping is over, and you're building a production system. You're in the thick of it and about to write a small chunk of server-side code. You estimate it will take 5 minutes to implement a clean but low-performance implementation; 7 minutes to implement clean and quickly-executing code; and 8 hours to do it optimally.

Which do you choose? We generally go for the 7-minute option. Perhaps it seems a no-brainer in this context, but you'd be surprised how many people choose the 5-minute option because it appears to save money. Yes, making the same decision several hundreds of times throughout a project adds up to measurable additional time. But our experience is that routinely selecting the "7-minute" option pays off in the long run.

Caveat: with web application development, your foundation technology choices determine if you're going with the "5-minute" approach to everything or the "7-minute" approach to everything. You can't mix and match.

With respect to software testing, most developers acknowledge that finding and resolving a defect before integration tests saves time and effort. It takes a bit more time for each developer, but it saves the effort of reporting, tracking, and dealing with bugs that leak beyond the developer's sieve. Curiously, the same sense doesn't necessarily apply to tuning and platform selection.

For production systems, we contend it's often better to select a platform that saves you money in the long run even if it means a few more dollars spent up-front.

Think of the money!

Money plays a big part in this. Allow me to be brutally clear: the more our clients' budgets go to Amazon or Rackspace, the less budget goes to continued development time. Development time is how we earn money. Besides, more development time means the client can be more agile with functionality and has a greater potential for success. So it's not an entirely selfish position.

Hosting is a necessary cost, but if you find yourself spinning up a second server when you have only a few hundred users, you might ask yourself why a CPU capable of processing billions of operations per second is brought to its knees by a few users submitting forms, placing orders, and communicating with one another.

Amazon is more than happy to accommodate you if you're not interested in asking this question. Jeff Bezos loves web application inefficiency.

Mixing technologies for a project can feel like concocting the right blend of chemicals. Some work well together and other clash. Project needs vary, of course, so technology chemistries vary in turn. Barring important indicators, our default assumptions steer us toward a mixture that aims to provide breathing room for scale together with a platform for reasonably quick evolution by the development team.

However, a severe budget limitation may tip toward a platform with ready-to-use open source components that can be leveraged as-is or with minimal tweaks. A need to on-board an in-house development team composed of recent college graduates (tempered with an understanding of performance and scale economics) pushes for Ruby/Rails, ensuring a talent pool from which to select.

Playing it safe

Building technology for non-technical clients pressures us to be mindful of that unknown future team. By comparison, an in-house development team starting a project may follow a more daring technology trajectory established by the technical director, especially if the director has a bunch of cohorts he or she can bring on board.

The clients we support often do not have an in-house team and must be prepared to eventually source that in-house team from the client's regional talent pool. This eventual hand-off to another team, in whole or in part, means that even if a client has a desire to dabble in new technologies, we still prefer to select those that have credible momentum and widespread usage. Vanilla node.js, MongoDB, Cassandra, and Apache Cordova for example. But we're not going to put a client on Meteor.js or MySQL's new NoSQL-like cluster. At least not yet. Similarly, we're not going to build key components on a platform for which it can be difficult or exceptionally expensive to source developers, such as Scala, unless the other project variables conspire accordingly.

Technology evolves and the number of people voicing their opinions keeps growing (welcome to this blog, by the way!) but getting work done requires shutting down the news feed and writing code. When working with other people's money to implement their ideas, we don't want to be technology early adopters. We'll satisfy those cravings after-hours.