Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2007

Tip: Looking for answers? Try searching our database.

Efficiently sequencing a stream of objects

Thread view: 
Chris - 21 Mar 2007 21:03 GMT
Does anyone know of an efficient algorithm for merging many streams of
objects?

Suppose I have N streams of words. Each stream returns words in
alphabetical sequence. For example:

stream 0: aardvark, bear, dog
stream 1: cat, groundhog
stream 3: eagle, fox

I want the output to be:
aardvark, bear, cat, dog, eagle, fox, groundhog.

The number of words can be large (in the millions) and the number of
streams can also be large (in the thousands).

The brute force approach is to create an array of streams, get the
lowest one from each, and output the lowest word found. Repeat until all
streams exhausted. The problem is that the running time is O(N * M),
where N is the number of streams and M is the *total number of elements
across all streams*. This is obviously not scalable.
Eric Sosman - 21 Mar 2007 21:52 GMT
> Does anyone know of an efficient algorithm for merging many streams of
> objects?
[quoted text clipped - 17 lines]
> where N is the number of streams and M is the *total number of elements
> across all streams*. This is obviously not scalable.

    Merge pairs of streams, then merge pairs of merged streams,
then merge pairs of twice-merged streams, ...  Once the tree is
set up and until it begins to collapse as the streams start to
run out, you'll need about lg(M) comparisons per item emitted.
Use K-way instead of two-way merges ("Pair up by threes" -- L.P.
Berra) and you'll use about (K-1)*ln(M)/ln(K) comparisons per
item, but lower overhead might compensate for the increased
comparison count.

    See also "replacement selection," which is not exactly what
you're trying to do but is closely related.

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Patricia Shanahan - 21 Mar 2007 22:09 GMT
>> Does anyone know of an efficient algorithm for merging many streams of
>> objects?
[quoted text clipped - 29 lines]
>     See also "replacement selection," which is not exactly what
> you're trying to do but is closely related.

Why this rather than java.util.PriorityQueue? The two approaches have he
same computational complexity, but the PriorityQueue implementation
should be much simpler.

Patricia
Eric Sosman - 21 Mar 2007 22:26 GMT
>>     Merge pairs of streams, then merge pairs of merged streams,
>> then merge pairs of twice-merged streams, ...  
>
> Why this rather than java.util.PriorityQueue? The two approaches have he
> same computational complexity, but the PriorityQueue implementation
> should be much simpler.

    PriorityQueue has several significant disadvantages, chief
among them being that <font size=tiny style=shamefaced> I didn't
think of it ... </font>

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Mark Thornton - 21 Mar 2007 22:53 GMT
>>>     Merge pairs of streams, then merge pairs of merged streams,
>>> then merge pairs of twice-merged streams, ...  
[quoted text clipped - 6 lines]
> among them being that <font size=tiny style=shamefaced> I didn't
> think of it ... </font>

I think the biggest failing of PriorityQueue was its absence prior to
jdk 1.5 (aka 5).

Mark Thornton
Patricia Shanahan - 21 Mar 2007 22:05 GMT
> Does anyone know of an efficient algorithm for merging many streams of
> objects?
[quoted text clipped - 17 lines]
> where N is the number of streams and M is the *total number of elements
> across all streams*. This is obviously not scalable.

java.util.PriorityQueue

Patricia
Chris - 22 Mar 2007 02:28 GMT
> java.util.PriorityQueue

I kind of thought that might help, but couldn't think of a way to
implement it efficiently. In a burst of caffeine, though, I produced the
following code. Thanks for the inspiration. This is elegant. The only
thing I don't like about it is all the casting required, but that's the
cost of using Collections.

/**
 * Wraps multiple Iterators, and returns the Objects in the Iterators
 * in the correct, ascending sequence. Assumes that the Iterators
 * contain Objects that implement the Comparable interface.
 */
public class MergingIterator implements Iterator {
  private PriorityQueue queue = new PriorityQueue();

  public void add(Iterator it) {
    if (!it.hasNext())
      return; // do nothing, do not add

    Carrier carrier = new Carrier();
    carrier.it = it;
    carrier.next = (Comparable)it.next();
    queue.offer(carrier);
  }

  public boolean hasNext() {
    return !queue.isEmpty();
  }

  public Object next() {
     Carrier carrier = (Carrier)queue.poll();
     Comparable next = carrier.next;

     // if there's another object to be had from this stream, insert it.
     // if not, the queue just gets smaller
     if (carrier.it.hasNext()) {
       carrier.next = (Comparable) carrier.it.next();
       queue.offer(carrier);
     }

    return next;
  }

  public void remove() {
    throw new UnsupportedOperationException();
  }

  private class Carrier implements Comparable {
    Iterator it;
    Comparable next;

    public int compareTo(Object other) {
      return next.compareTo(((Carrier)other).next);
    }
  }
}
Patricia Shanahan - 22 Mar 2007 02:49 GMT
>> java.util.PriorityQueue
>
[quoted text clipped - 3 lines]
> thing I don't like about it is all the casting required, but that's the
> cost of using Collections.

Nice. Take a look at generics. You might be able to do something to
reduce the casting with the 1.5 collections.

Patricia
Chris - 22 Mar 2007 02:56 GMT
>>> java.util.PriorityQueue
>>
[quoted text clipped - 8 lines]
>
> Patricia

Generics, unfortunately, are worthless for performance improvements.
They provide only compile-time type checking. At runtime, all the casts
are still there.
Piotr Kobzda - 22 Mar 2007 03:22 GMT
> Generics, unfortunately, are worthless for performance improvements.
> They provide only compile-time type checking. At runtime, all the casts
> are still there.

Not all cast are there.  Compare your version with "genericified" one
(attached below).

piotr

Signature

public class MergingIterator<E extends Comparable<? super E>> implements
Iterator<E> {
  private PriorityQueue<Carrier> queue = new PriorityQueue<Carrier>();

  public void add(Iterator<? extends E> it) {
    if (!it.hasNext())
      return; // do nothing, do not add

    Carrier carrier = new Carrier();
    carrier.it = it;
    carrier.next = it.next();
    queue.offer(carrier);
  }

  public boolean hasNext() {
    return !queue.isEmpty();
  }

  public E next() {
     Carrier carrier = queue.poll();
     E next = carrier.next;

     // if there's another object to be had from this stream, insert it.
     // if not, the queue just gets smaller
     if (carrier.it.hasNext()) {
       carrier.next = carrier.it.next();
       queue.offer(carrier);
     }

     return next;
  }

  public void remove() {
    throw new UnsupportedOperationException();
  }

  private class Carrier implements Comparable<Carrier> {
    Iterator<? extends E> it;
    E next;

    public int compareTo(Carrier other) {
      return next.compareTo(other.next);
    }
  }
}

Piotr Kobzda - 22 Mar 2007 09:10 GMT
>> Generics, unfortunately, are worthless for performance improvements.
>> They provide only compile-time type checking. At runtime, all the
>> casts are still there.
>
> Not all cast are there.  Compare your version with "genericified" one
> (attached below).

Well, I compared them instruction by instruction, and Chris is right,
all casts are still there.  Not exactly in the same places, but
effectively all casts from original version are performed in generic
code also.

I assumed at first, that generic version of Carrier will take away one
cast, unfortunately, that cast is moved into synthetic compareTo(Objcet)
method generated to fulfill all needs of parameterized Comparator
implementation.

Maybe planned the next generation of generics will improve that somehow:
http://tech.puredanger.com/java7#reified

However, even without any performance improvements, there are still
other benefits of generic version, like clearer implementation (IMHO),
constrains previously expressed in comments only, are now checked by the
compiler, etc.

piotr
Chris Uppal - 22 Mar 2007 13:02 GMT
> http://tech.puredanger.com/java7#reified

That page is an interesting -- if depressing -- roundup.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.