Java Forum / General / March 2007
Efficiently sequencing a stream of objects
Chris - 21 Mar 2007 21:03 GMT Does anyone know of an efficient algorithm for merging many streams of objects?
Suppose I have N streams of words. Each stream returns words in alphabetical sequence. For example:
stream 0: aardvark, bear, dog stream 1: cat, groundhog stream 3: eagle, fox
I want the output to be: aardvark, bear, cat, dog, eagle, fox, groundhog.
The number of words can be large (in the millions) and the number of streams can also be large (in the thousands).
The brute force approach is to create an array of streams, get the lowest one from each, and output the lowest word found. Repeat until all streams exhausted. The problem is that the running time is O(N * M), where N is the number of streams and M is the *total number of elements across all streams*. This is obviously not scalable.
Eric Sosman - 21 Mar 2007 21:52 GMT > Does anyone know of an efficient algorithm for merging many streams of > objects? [quoted text clipped - 17 lines] > where N is the number of streams and M is the *total number of elements > across all streams*. This is obviously not scalable. Merge pairs of streams, then merge pairs of merged streams, then merge pairs of twice-merged streams, ... Once the tree is set up and until it begins to collapse as the streams start to run out, you'll need about lg(M) comparisons per item emitted. Use K-way instead of two-way merges ("Pair up by threes" -- L.P. Berra) and you'll use about (K-1)*ln(M)/ln(K) comparisons per item, but lower overhead might compensate for the increased comparison count.
See also "replacement selection," which is not exactly what you're trying to do but is closely related.
 Signature Eric Sosman esosman@acm-dot-org.invalid
Patricia Shanahan - 21 Mar 2007 22:09 GMT >> Does anyone know of an efficient algorithm for merging many streams of >> objects? [quoted text clipped - 29 lines] > See also "replacement selection," which is not exactly what > you're trying to do but is closely related. Why this rather than java.util.PriorityQueue? The two approaches have he same computational complexity, but the PriorityQueue implementation should be much simpler.
Patricia
Eric Sosman - 21 Mar 2007 22:26 GMT >> Merge pairs of streams, then merge pairs of merged streams, >> then merge pairs of twice-merged streams, ... > > Why this rather than java.util.PriorityQueue? The two approaches have he > same computational complexity, but the PriorityQueue implementation > should be much simpler. PriorityQueue has several significant disadvantages, chief among them being that <font size=tiny style=shamefaced> I didn't think of it ... </font>
 Signature Eric Sosman esosman@acm-dot-org.invalid
Mark Thornton - 21 Mar 2007 22:53 GMT >>> Merge pairs of streams, then merge pairs of merged streams, >>> then merge pairs of twice-merged streams, ... [quoted text clipped - 6 lines] > among them being that <font size=tiny style=shamefaced> I didn't > think of it ... </font> I think the biggest failing of PriorityQueue was its absence prior to jdk 1.5 (aka 5).
Mark Thornton
Patricia Shanahan - 21 Mar 2007 22:05 GMT > Does anyone know of an efficient algorithm for merging many streams of > objects? [quoted text clipped - 17 lines] > where N is the number of streams and M is the *total number of elements > across all streams*. This is obviously not scalable. java.util.PriorityQueue
Patricia
Chris - 22 Mar 2007 02:28 GMT > java.util.PriorityQueue I kind of thought that might help, but couldn't think of a way to implement it efficiently. In a burst of caffeine, though, I produced the following code. Thanks for the inspiration. This is elegant. The only thing I don't like about it is all the casting required, but that's the cost of using Collections.
/** * Wraps multiple Iterators, and returns the Objects in the Iterators * in the correct, ascending sequence. Assumes that the Iterators * contain Objects that implement the Comparable interface. */ public class MergingIterator implements Iterator { private PriorityQueue queue = new PriorityQueue();
public void add(Iterator it) { if (!it.hasNext()) return; // do nothing, do not add
Carrier carrier = new Carrier(); carrier.it = it; carrier.next = (Comparable)it.next(); queue.offer(carrier); }
public boolean hasNext() { return !queue.isEmpty(); }
public Object next() { Carrier carrier = (Carrier)queue.poll(); Comparable next = carrier.next;
// if there's another object to be had from this stream, insert it. // if not, the queue just gets smaller if (carrier.it.hasNext()) { carrier.next = (Comparable) carrier.it.next(); queue.offer(carrier); }
return next; }
public void remove() { throw new UnsupportedOperationException(); }
private class Carrier implements Comparable { Iterator it; Comparable next;
public int compareTo(Object other) { return next.compareTo(((Carrier)other).next); } } }
Patricia Shanahan - 22 Mar 2007 02:49 GMT >> java.util.PriorityQueue > [quoted text clipped - 3 lines] > thing I don't like about it is all the casting required, but that's the > cost of using Collections. Nice. Take a look at generics. You might be able to do something to reduce the casting with the 1.5 collections.
Patricia
Chris - 22 Mar 2007 02:56 GMT >>> java.util.PriorityQueue >> [quoted text clipped - 8 lines] > > Patricia Generics, unfortunately, are worthless for performance improvements. They provide only compile-time type checking. At runtime, all the casts are still there.
Piotr Kobzda - 22 Mar 2007 03:22 GMT > Generics, unfortunately, are worthless for performance improvements. > They provide only compile-time type checking. At runtime, all the casts > are still there. Not all cast are there. Compare your version with "genericified" one (attached below).
piotr
 Signature public class MergingIterator<E extends Comparable<? super E>> implements Iterator<E> { private PriorityQueue<Carrier> queue = new PriorityQueue<Carrier>();
public void add(Iterator<? extends E> it) { if (!it.hasNext()) return; // do nothing, do not add
Carrier carrier = new Carrier(); carrier.it = it; carrier.next = it.next(); queue.offer(carrier); }
public boolean hasNext() { return !queue.isEmpty(); }
public E next() { Carrier carrier = queue.poll(); E next = carrier.next;
// if there's another object to be had from this stream, insert it. // if not, the queue just gets smaller if (carrier.it.hasNext()) { carrier.next = carrier.it.next(); queue.offer(carrier); }
return next; }
public void remove() { throw new UnsupportedOperationException(); }
private class Carrier implements Comparable<Carrier> { Iterator<? extends E> it; E next;
public int compareTo(Carrier other) { return next.compareTo(other.next); } } }
Piotr Kobzda - 22 Mar 2007 09:10 GMT >> Generics, unfortunately, are worthless for performance improvements. >> They provide only compile-time type checking. At runtime, all the >> casts are still there. > > Not all cast are there. Compare your version with "genericified" one > (attached below). Well, I compared them instruction by instruction, and Chris is right, all casts are still there. Not exactly in the same places, but effectively all casts from original version are performed in generic code also.
I assumed at first, that generic version of Carrier will take away one cast, unfortunately, that cast is moved into synthetic compareTo(Objcet) method generated to fulfill all needs of parameterized Comparator implementation.
Maybe planned the next generation of generics will improve that somehow: http://tech.puredanger.com/java7#reified
However, even without any performance improvements, there are still other benefits of generic version, like clearer implementation (IMHO), constrains previously expressed in comments only, are now checked by the compiler, etc.
piotr
Chris Uppal - 22 Mar 2007 13:02 GMT > http://tech.puredanger.com/java7#reified That page is an interesting -- if depressing -- roundup.
-- chris
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|