Skip to main content


Showing posts from July, 2014

Testing code for excessively large inputs

When writing unit tests we mostly focus on business correctness. We do our best to exercise happy path and all edge cases. We sometimes microbenchmark and measure throughput. But one aspect that is often missed is how our code behaves when the input is excessively large? We test how we handle normal input files, malformed files, empty files, missing files... but what about insanely large input files?

Let's start from a real-life use case. You were given a task to implement GPX (GPS Exchange Format, basically XML) to JSON transformation. I chose GPX for no particular reason, it's just another XML format that you might have come across e.g. when recording your hike or bicycle ride with GPS receiver. Also I thought it will be nice to use some standard rather than yet another "people database" in XML. Inside GPX file there are hundreds of flat <wpt/> entries, each one representing one point in space-time:

<gpx> <wpt lat="42.438878" lon="-…

Building extremely large in-memory InputStream for testing purposes

For some reason I needed extremely large, possibly even infinite InputStream that would simply return the same byte[] over and over. This way I could produce insanely big stream of data by repeating small sample. Sort of similar functionality can be found in Guava: Iterable<T> Iterables.cycle(Iterable<T>) and Iterator<T> Iterators.cycle(Iterator<T>). For example if you need an infinite source of 0 and 1, simply say Iterables.cycle(0, 1) and get 0, 1, 0, 1, 0, 1... infinitely. Unfortunately I haven't found such utility for InputStream, so I jumped into writing my own. This article documents many mistakes I made during that process, mostly due to overcomplicating and overengineering straightforward solution.

We don't really need an infinite InputStream, being able to create very large one (say, 32 GiB) is enough. So we are after the following method:

public static InputStream repeat(byte[] sample, int times) It basically takes sample array of bytes and re…

Grouping, sampling and batching - custom collectors in Java 8

Continuing first article, this time we will write some more useful custom collectors: for grouping by given criteria, sampling input, batching and sliding over with fixed size window.

Grouping (counting occurrences, histogram) Imagine you have a collection of some items and you want to calculate how many times each item (with respect to equals()) appears in this collection. This can be achieved using CollectionUtils.getCardinalityMap() from Apache Commons Collections. This method takes an Iterable<T> and returns Map<T, Integer>, counting how many times each item appeared in the collection. However sometimes instead of using equals() we would like to group by an arbitrary attribute of input T. For example say we have a list of Person objects and we would like to compute the number of males vs. females (i.e. Map<Sex, Integer>) or maybe an age distribution. There is a built-in collector Collectors.groupingBy(Function<T, K> classifier) - however it returns a map fr…

Introduction to writing custom collectors in Java 8

Java 8 introduced the concept of collectors. Most of the time we barely use factory methods from Collectors class, e.g. collect(toList()), toSet() or maybe something more fancy like counting() or groupingBy(). Not many of us actually bother to look how collectors are defined and implemented. Let's start from analysing what Collector<T, A, R> really is and how it works.

Collector<T, A, R> works as a "sink" for streams - stream pushes items (one after another) into a collector, which should produce some "collected" value in the end. Most of the time it means building a collection (like toList()) by accumulating elements or reducing stream into something smaller (e.g. counting() collector that barely counts elements). Every collector accepts items of type T and produces aggregated (accumulated) value of type R (e.g. R = List<T>). Generic type A simply defines the type of intermediate mutable data structure that we are going to use to accumulate it…

Turning recursive file system traversal into Stream

When I was learning programming, back in the days of Turbo Pascal, I managed to list files in directory using FindFirst, FindNext and FindClose functions. First I came up with a procedure printing contents of a given directory. You can imagine how proud I was to discover I can actually call that procedure from itself to traverse file system recursively. Well, I didn't know the term recursion back then, but it worked. Similar code in Java would look something like this:

public void printFilesRecursively(final File folder) { for (final File entry : listFilesIn(folder)) { if (entry.isDirectory()) { printFilesRecursively(entry); } else { System.out.println(entry.getAbsolutePath()); } } } private File[] listFilesIn(File folder) { final File[] files = folder.listFiles(); return files != null ? files : new File[]{}; } Didn't know File.listFiles() can return null, did ya? That's how it signals I/O errors, like if IOE…