Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2007

Tip: Looking for answers? Try searching our database.

Symbolic benchmark

Thread view: 
Jon Harrop - 10 Dec 2007 09:38 GMT
Following recent discussions here about benchmarking and when Java's
performance is worst, I proposed the idea of symbolic computation because
this entails very high allocation rates for short-lived objects when
written in an idiomatic functional style and Java should be uniquely bad at
this. While you might be able to work around the functional style using
imperative constructs, it will be extremely tedious and error-prone for
such applications.

So, perhaps someone would like to translate this simple symbolic simplifier
into Java:

 http://www.lambdassociates.org/studies/study10.htm

Here is the OCaml implementation:

let rec ( +: ) f g = match f, g with
  | `Int n, `Int m -> `Int (n +/ m)
  | `Int (Int 0), e | e, `Int (Int 0) -> e
  | f, `Add(g, h) -> f +: g +: h
  | f, g -> `Add(f, g)

let rec ( *: ) f g = match f, g with
  | `Int n, `Int m -> `Int (n */ m)
  | `Int (Int 0), e | e, `Int (Int 0) -> `Int (Int 0)
  | `Int (Int 1), e | e, `Int (Int 1) -> e
  | f, `Mul(g, h) -> f *: g *: h
  | f, g -> `Mul(f, g)

let rec simplify = function
  | `Int _ | `Var _ as f -> f
  | `Add (f, g) -> simplify f +: simplify g
  | `Mul (f, g) -> simplify f *: simplify g

OCaml is not terribly efficient at this benchmark but I believe it will be
many times faster than any Java solution.

The program simply traverses a symbolic expression, reconstructing it whilst
applying some simple rewrite rules to simplify it. For example, "(a+0)*1*b"
becomes "a*b". The applications are obvious: parsers, compilers,
interpreters, scientific computing and computer algebra.

Note that the symbolic expression is immutable: you must not destroy your
input! Also, numbers are all arbitrary precision integers.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Lew - 10 Dec 2007 15:27 GMT
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
> this entails very high allocation rates for short-lived objects when
> written in an idiomatic functional style and Java should be uniquely bad at
> this.

Java is very good at high allocation rates for short-lived objects; it's
optimized for it.  Don't believe Jon's rants, people.

Signature

Lew

Jon Harrop - 10 Dec 2007 22:32 GMT
>> Following recent discussions here about benchmarking and when Java's
>> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 4 lines]
> Java is very good at high allocation rates for short-lived objects; it's
> optimized for it.  Don't believe Jon's rants, people.

Prove it by implementing that benchmark in Java.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Daniel Pitts - 10 Dec 2007 23:48 GMT
>>> Following recent discussions here about benchmarking and when Java's
>>> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 5 lines]
>
> Prove it by implementing that benchmark in Java.

Actually, why don't you prove it (or disprove it). I'm sure most of the
Java advocates here have better things to do than chase your wild goose.

Signature

Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Jon Harrop - 11 Dec 2007 00:43 GMT
>>>> Following recent discussions here about benchmarking and when Java's
>>>> performance is worst, I proposed the idea of symbolic computation
[quoted text clipped - 8 lines]
> Actually, why don't you prove it (or disprove it). I'm sure most of the
> Java advocates here have better things to do than chase your wild goose.

Because you would only blame Java's poor performance on me rather than Java.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Lew - 11 Dec 2007 02:28 GMT
>>>>> Following recent discussions here about benchmarking and when Java's
>>>>> performance is worst, I proposed the idea of symbolic computation
[quoted text clipped - 8 lines]
>
> Because you would only blame Java's poor performance on me rather than Java.

Oh, so when it's time to sh.t or get off the pot, you just attack the person
and (likely inaccurately) predict bad behavior on their part to excuse your
lack of intellectual honesty, Mr. Cambridge Don.

Signature

Lew

Jon Harrop - 11 Dec 2007 10:03 GMT
> Oh, so when it's time to sh.t or get off the pot, you just attack the
> person and (likely inaccurately) predict bad behavior on their part to
> excuse your lack of intellectual honesty, Mr. Cambridge Don.

Sounds like you can't write an efficient Java implementation of this
benchmark either.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Lew - 11 Dec 2007 14:32 GMT
>> Oh, so when it's time to sh.t or get off the pot, you just attack the
>> person and (likely inaccurately) predict bad behavior on their part to
>> excuse your lack of intellectual honesty, Mr. Cambridge Don.
>
> Sounds like you can't write an efficient Java implementation of this
> benchmark either.

Yada, yada, Mr. Cambridge Almighty Don.

Signature

Lew

Daniel Pitts - 11 Dec 2007 18:37 GMT
>> Oh, so when it's time to sh.t or get off the pot, you just attack the
>> person and (likely inaccurately) predict bad behavior on their part to
>> excuse your lack of intellectual honesty, Mr. Cambridge Don.
>
> Sounds like you can't write an efficient Java implementation of this
> benchmark either.

What's with the personal attacks here? No one accused you of being unable.

I personally said I was unwilling to at this time.

Signature

Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Steve Wampler - 11 Dec 2007 00:18 GMT
>>> Following recent discussions here about benchmarking and when Java's
>>> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 5 lines]
>
> Prove it by implementing that benchmark in Java.

Since we've been talking about C/C++ and Java, got a version in C or C++? :)

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Jon Harrop - 11 Dec 2007 00:46 GMT
>> Prove it by implementing that benchmark in Java.
>
> Since we've been talking about C/C++ and Java, got a version in C or C++?
> :)

I posted something similar here (for derivatives rather than
simplification):

 http://www.codecodex.com/wiki/index.php?title=Derivative

but this problem really requires a GC to do it properly because you must be
able to handle shared subexpressions.

#include <iostream>

using namespace std;

class Var;

class Base {
public:
 virtual ~Base() {};
 virtual const Base *clone() = 0;
 virtual const Base *d(const string &v) const = 0;
 virtual ostream &print(ostream &o) const = 0;
};

ostream &operator<<(ostream &o, const Base &e) { e.print(o); return o; }

class Int : public Base {
 const int n;
public:
 Int(int m) : n(m) {}
 ~Int() {}
 const Base *clone() { return new Int(n); }
 const Base *d(const string &v) const { return new Int(0); }
 ostream &print(ostream &o) const { return o << n; }
};

class Var : public Base {
 const string var;
public:
 Var(string v) : var(v) {}
 ~Var() {}
 const Base *clone() { return new Var(var); }
 const Base *d(const string &v) const { return new Int(var == v ? 1 : 0); }
 ostream &print(ostream &o) const { return o << var; }
};

class Plus : public Base {
 const Base *e1, *e2;
public:
 Plus(const Base *a, const Base *b) : e1(a), e2(b) {}
 ~Plus() { delete e1; delete e2; }
 const Base *clone() { return new Plus(e1, e2); }
 const Base *d(const string &v) const { return new Plus(e1->d(v),
e2->d(v)); }
 ostream &print(ostream &o) const
 { return o << "(" << *e1 << " + " << *e2 << ")"; }
};

class Times : public Base {
 const Base *e1, *e2;
public:
 Times(const Base *a, const Base *b) : e1(a), e2(b) {}
 ~Times() { delete e1; delete e2; }
 const Base *clone() { return new Times(e1, e2); }
 const Base *d(const string &v) const
 { return new Plus(new Times(e1, e2->d(v)), new Times(e1->d(v), e2)); }
 ostream &print(ostream &o) const { return o << "(" << *e1 << " * " << *e2
<< ")"; }
};

class Expr {
public:
 Base *e;
 Expr(Base *a) : e(a) {}
};

const Expr operator+(const Expr e1, const Expr e2)
{ return Expr(new Plus(e1.e->clone(), e2.e->clone())); }
const Expr operator*(const Expr e1, const Expr e2)
{ return Expr(new Times(e1.e->clone(), e2.e->clone())); }

ostream &operator<<(ostream &o, const Expr e) { return o << e.e; }

int main() {
 Var vx("x"), va("a"), vb("b"), vc("c");
 Expr x(&vx), a(&va), b(&vb), c(&vc);
 Expr e = a*x*x + b*x + c;
 cout << "d(" << *(e.e) << ", x) = " << *(e.e->d("x")) << endl;
 return 0;
}

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Roger Lindsjö - 11 Dec 2007 12:36 GMT
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 6 lines]
> So, perhaps someone would like to translate this simple symbolic simplifier
> into Java:

I'll post the code tonight. I have not tried to reduce the line count.

My program first prints the expression, then run 10 loops with 10000000
iterations and prints the duration for each loop. At the end it prints
the final reduction. The machine is a 2.4GHz Core 2 Quad.

[* x [+ [+ [* 12 0] [+ 23 8]] y]]
1355
1281
1275
1273
1277
1284
1279
1279
1272
1274
[* x [+ 31 y]]

I'm used to Ocaml, but it looked like Haskell, so I think my simplify is
correct.

> The program simply traverses a symbolic expression, reconstructing it whilst
> applying some simple rewrite rules to simplify it. For example, "(a+0)*1*b"
[quoted text clipped - 3 lines]
> Note that the symbolic expression is immutable: you must not destroy your
> input! Also, numbers are all arbitrary precision integers.

The last rule, that it should handle arbitrary precision integers made
me use BigIntegers, with int the program ran about twice as fast.

//Roger Lindsjö
Roger Lindsjö - 11 Dec 2007 19:52 GMT
> I'm used to Ocaml, but it looked like Haskell, so I think my simplify is
> correct.

Of course I meant to say "I'm not used to Ocaml..."

//Roger Lindsjö
Jon Harrop - 11 Dec 2007 20:18 GMT
>> I'm used to Ocaml, but it looked like Haskell, so I think my simplify is
>> correct.
>
> Of course I meant to say "I'm not used to Ocaml..."

Can we have the Java code? :-)

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Roger Lindsjö - 11 Dec 2007 23:08 GMT
Before being too enthusiastic, I did not have the associative rules
implemented, now my times are about 1.6 seconds for 10000000 iterations.

Also, while testing I notice that with the rules (at least the way I
implemented them) [+ [+ x 3] [+ 1 y]] would get simplified to
[+ [+ [+ x 3] 1] y]. Jon, it that true for your implementation also? (No
need posting code if it does not fully work).

Jon, feel free to email me, just use my first name instead of news.nospam.

//Roger Lindsjö
Jon Harrop - 12 Dec 2007 13:16 GMT
> Before being too enthusiastic, I did not have the associative rules
> implemented, now my times are about 1.6 seconds for 10000000 iterations.
[quoted text clipped - 3 lines]
> [+ [+ [+ x 3] 1] y]. Jon, it that true for your implementation also? (No
> need posting code if it does not fully work).

That is the same, yes.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Roger Lindsjö - 12 Dec 2007 20:49 GMT
I have been talking to Don off line and I used Chris Dollin's idea of
using cached values for 0 and 1. (Chris and my code were actually quite
similar from start, and on par for speed, but after using Cris ideo of
cached 0 and 1 my end up 10% faster or so). I have changed some names to
look more like Chris code to make them easier to compare. Initially I
did not use instanceof but instead had a type for each element which I
could use to check the type. This was actually slower than instanceof
(surprised me).

Using BigInteger or BigDecimal made no difference in my test, so I used
BigDecimal.

The machine I used was a 2.4GHz Core 2 Quad with 2GB memory, and while
running any of the programs they seemed to use 100% of one core.

java Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 1.16 seconds
Bench 1
Took 1.11 seconds
Bench 2
Took 1.104 seconds
[* x [+ 31 y]]

However, after installing Ocaml I compiled Dan's program and ran. The
output was (I modified it to run several times):
time ./simplify
Took 1.514769s
Took 1.515770s
Took 1.518769s

So now I get confused, according to Don this code should be about twice
as fast as the Java code, but in my tests Java is slightly faster. Why
is that? Perhaps I have the wrong compiler, mine says 3.09.3. There
seemed to be a 3.10 out, but is wasn't part of Fedora 7 regular
repository. Could upgrading to 3.10 actually make the program at least
twice as fast?

Can anyone explain? Could it be that on my architecture the HotSpot is
better at optimizing than the Ocaml compiler?

I then modified my program to use long instead of BigDecimal and got the
following:
java SimplifyLong 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 0.61 seconds
Bench 1
Took 0.549 seconds
Bench 2
Took 0.558 seconds
[* x [+ 31 y]]

Simplify.java
import java.math.BigDecimal;

public class Simplify {
  public static void main(String[] args) {
    int lim = Integer.parseInt(args[0]);
    int nbench = Integer.parseInt(args[1]);

    Expr expr =
     mul(var("x"),
        add(
         add(mul(rat(12),rat(0)), add(rat(23),rat(8))),
         var("y")
         )
        );

    System.out.println(expr);
    Expr reduced = null;
    for (int batch = 0; batch < nbench; batch++) {
      long start = System.currentTimeMillis();
      System.err.println("Bench " + batch);
      for (int i = 0; i < lim; i++) {
        reduced = expr.simplify();
      }
      System.err.println("Took " +
        (System.currentTimeMillis() - start) / 1000.0 + " seconds");
    }
    System.out.println(reduced);
  }

  private static Expr mul(Expr a, Expr b) {
    return new Mul(a,b);
  }
  private static Expr add(Expr a, Expr b) {
    return new Add(a,b);
  }
  private static Expr rat(long a) {
      return Val.create(BigDecimal.valueOf(a));
  }
  private static Expr var(String s) {
    return new Sym(s);
  }

  public static abstract class Expr {
      public abstract Expr simplify();
  }

  public static final class Add extends Expr {
    private final Expr l, r;

    public Add(Expr l, Expr r) {
      this.l = l;
      this.r = r;
    }

    public Expr simplify() {
      return add(l.simplify(), r.simplify());
    }

    private Expr add(Expr l, Expr r) {
      if (Val.ZERO == l) {
    return r;
      } else if (Val.ZERO == r) {
    return l;
      } else if (l instanceof Val && r instanceof Val) {
    return add((Val) l, (Val) r) ;
      } else if (r instanceof Add) {
    return add(l, (Add) r);
      } else {
    return new Add(l, r);
      }
    }

    private Expr add(Val l, Val r) {
      return Val.create(l.getValue().add(r.getValue()));
    }

    private Expr add(Expr l, Add r) {
      return new Add(new Add(l, r.l), r.r).simplify();
    }

    public String toString() {
      return "[+ " + l + " " + r +"]";
    }
  }

  public static class Mul extends Expr {
    private final Expr l, r;
    public Mul(Expr l, Expr r) {
      this.l = l;
      this.r = r;
    }

    public Expr simplify() {
      return mul(l.simplify(), r.simplify());
    }

    private Expr mul(Expr l, Expr r) {
      if (Val.ZERO == l || Val.ZERO == r) {
    return Val.ZERO;
      } else if (Val.ONE == l) {
    return r;
      } else if (Val.ONE == r) {
    return l;
      } else if (l instanceof Val && r instanceof Val) {
    return mul((Val) l, (Val) r) ;
      } else if (r instanceof Mul) {
    return mul(l, (Mul) r);
      } else {
    return new Mul(l, r);
      }
    }

    private Expr mul(Val l, Val r) {
      return Val.create(l.getValue().multiply(r.getValue()));
    }

    private Expr mul(Expr l, Mul r) {
      return new Mul(new Mul(l, r.l), r.r).simplify();
    }

    public String toString() {
      return "[* " + l + " " + r +"]";
    }
  }

  public static class Sym extends Expr {
    private final String symbol;

    public Sym(String symbol) {
      this.symbol = symbol;
    }

    public Expr simplify() {
      return this;
    }

    public String toString() {
      return symbol;
    }
  }

  public static class Val extends Expr {
    public static final Val ZERO = new Val(BigDecimal.ZERO);
    public static final Val ONE = new Val(BigDecimal.ONE);

    private final BigDecimal value;

    private Val(BigDecimal value) {
      this.value = value;
    }

    public Expr simplify() {
      return this;
    }

    public static Val create(BigDecimal value) {
      if (BigDecimal.ZERO == value || BigDecimal.ZERO.equals(value)) {
        return ZERO;
      } else if (BigDecimal.ONE == value && BigDecimal.ONE.equals(value)) {
        return ONE;
      } else {
        return new Val(value);
      }
    }

    public BigDecimal getValue() {
      return value;
    }

    public String toString() {
      return String.valueOf(value);
    }
  }
}

//Roger Lindsjö
Roger Lindsjö - 13 Dec 2007 15:25 GMT
I have done some more test, and used the same Ocaml compiler (3.09.3)
and the same Java (1.6 update 3). The times posted are averages over a
few iterations. The processes were locked to one cpu to not allow the gc
running on a separate cpu, but this didn't make any noticeable
difference. Both machines runs late (but not the same) GNU/Linux kernels
(Fedora Core and Fedora systems).

Machine 1 is a 3.2GHz Pentium with 2 cores (not hyperthreading). A 64
bit machine (were they called Pentium D?). 2GB memory.
Java: 5.123 s
Ocaml: 1.142 s

Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
Java: 1.074 s
Ocaml: 1.515 s

The compilers must be very different in how they utilize the cpu
architecture. Sitting on machine 1 (not my primary computer) I'd say
Ocaml is faster, but sitting on machine 2 (my primary machine ;-) ) I'd
say Java is faster.

Perhaps Java is better at utilizing the large cache on machine 2? I
don't know.

//Roger Lindsjö
Steve Wampler - 13 Dec 2007 15:55 GMT
> The compilers must be very different in how they utilize the cpu
> architecture. Sitting on machine 1 (not my primary computer) I'd say
> Ocaml is faster, but sitting on machine 2 (my primary machine ;-) ) I'd
> say Java is faster.

Thanks for testing this - I've been trying to figure out why Jon and I
have been getting such different results.  This may well explain it.

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Lew - 13 Dec 2007 17:03 GMT
>> The compilers must be very different in how they utilize the cpu
>> architecture. Sitting on machine 1 (not my primary computer) I'd say
[quoted text clipped - 3 lines]
> Thanks for testing this - I've been trying to figure out why Jon and I
> have been getting such different results.  This may well explain it.

This supports the points of those who say, "Watch out for benchmarks," and,
"Java can be as fast as C++", and. "Java is slower that C++", and, "It depends".

Signature

Lew

bugbear - 14 Dec 2007 13:58 GMT
>>> The compilers must be very different in how they utilize the cpu
>>> architecture. Sitting on machine 1 (not my primary computer) I'd say
[quoted text clipped - 7 lines]
> and, "Java can be as fast as C++", and. "Java is slower that C++", and,
> "It depends".

May I point out the existence of comp.benchmarks  ?

  BugBear
Roedy Green - 13 Dec 2007 21:09 GMT
On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
<news.nospam@tilialacus.net> wrote, quoted or indirectly quoted
someone who said :

>Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
>Java: 1.074 s
>Ocaml: 1.515 s

If you pass me the Java code, I can run it both with java.exe and with
Jet.  That ratio would give you another datapoint, compiled Java.

I have a dual core Athlon.  I think there are no threads in your
implementation (are there in the Ocaml?), you would expect dual core
to buy you nothing.

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Steve Wampler - 13 Dec 2007 21:46 GMT
> I have a dual core Athlon.  I think there are no threads in your
> implementation (are there in the Ocaml?), you would expect dual core
> to buy you nothing.

I don't know how Jet handles the execution environment, but Sun's
JDK runs lots of administrative threads while a program is running.
I would expect some benefit from dual cores with that, but apparently
(from Roger's note) it's not much.

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Roger Lindsjö - 13 Dec 2007 22:16 GMT
> On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
> <news.nospam@tilialacus.net> wrote, quoted or indirectly quoted
[quoted text clipped - 10 lines]
> implementation (are there in the Ocaml?), you would expect dual core
> to buy you nothing.

No threading other than what the JVM does internally for housekeeping
such as GC. I even ran restricted to one core (using taskset) since I
was told that Ocaml was single threaded.

The code is 3 posts up, the one I was relying to.
http://groups.google.com/group/comp.lang.java.programmer/msg/b5185f597ba62103

//Roger Lindsjö
Lew - 14 Dec 2007 01:32 GMT
> No threading other than what the JVM does internally for housekeeping
> such as GC. I even ran restricted to one core (using taskset) since I
> was told that Ocaml was single threaded.

If it's true that Ocaml is single-threaded, then that kills it for a wide
range of applications for which Java is extremely suitable.  Performance would
be moot, even if the numbers weren't so close.

Signature

Lew

Daniel Dyer - 14 Dec 2007 02:10 GMT
>> No threading other than what the JVM does internally for housekeeping  
>> such as GC. I even ran restricted to one core (using taskset) since I  
[quoted text clipped - 3 lines]
> wide range of applications for which Java is extremely suitable.  
> Performance would be moot, even if the numbers weren't so close.

http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73
338fb15.en.html


I briefly looked at OCaml because I'd heard a lot of good things about.  
The threading thing did put me off.  It seems there are issues with its  
garbage collector implementation in this respect.

Dan.

Signature

Daniel Dyer
http://www.uncommons.org

Jon Harrop - 14 Dec 2007 09:51 GMT
>> No threading other than what the JVM does internally for housekeeping
>> such as GC. I even ran restricted to one core (using taskset) since I
[quoted text clipped - 3 lines]
> range of applications for which Java is extremely suitable.  Performance
> would be moot, even if the numbers weren't so close.

OCaml simply uses processes for concurrency rather than threads. In terms of
capabilities, there is no difference.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Lew - 14 Dec 2007 13:46 GMT
>>> No threading other than what the JVM does internally for housekeeping
>>> such as GC. I even ran restricted to one core (using taskset) since I
[quoted text clipped - 5 lines]
> OCaml simply uses processes for concurrency rather than threads. In terms of
> capabilities, there is no difference.

I'm not familiar with OCaml.  Does it have a library of concurrency
constructs, such as blocking / non-blocking queues, monitors or other locks,
memory-model coordinating constructs cognate to Java's "volatile" and the like
to coordinate shared data?

Signature

Lew

Jon Harrop - 14 Dec 2007 14:50 GMT
> I'm not familiar with OCaml.  Does it have a library of concurrency
> constructs, such as blocking / non-blocking queues, monitors or other
> locks, memory-model coordinating constructs cognate to Java's "volatile"
> and the like to coordinate shared data?

No, OCaml has data parallel and message passing libraries instead. These are
higher level constructs for parallelism and concurrency that abstract away
the need to use the low-level constructs like monitors and locks directly.
This is easier to use and arguably safer (although I have no evidence to
back up the latter in general, outside of Erlang's overwhelming victory for
massively-concurrent applications).

There are complicated trade-offs involved here but, essentially, OCaml
sacrifices high-performance inter-thread communication (that you can
achieve with) for better scaling because its GC requires no garbage
collection.

There is also a new research language called JoCaml that integrates support
for concurrency directly into the OCaml language using the join calculus.

You may also be interested in related efforts such as the recent support for
hinted parallelism in Haskell and F#'s recently introduction asynchronous
workflows.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Chris Dollin - 14 Dec 2007 15:33 GMT
> There are complicated trade-offs involved here but, essentially, OCaml
> sacrifices high-performance inter-thread communication (that you can
> achieve with) for better scaling because its GC requires no garbage
> collection.

Did something go wrong with that paragraph, possibly just over-ambitions
packing? I can't make consistent sense of it.

Signature

Chris "perhaps it's the Friday effect" Dollin

Hewlett-Packard Limited                                          registered no:
registered office: Cain Road, Bracknell, Berks RG12 1HN          690597 England

Jon Harrop - 14 Dec 2007 18:57 GMT
>> There are complicated trade-offs involved here but, essentially, OCaml
>> sacrifices high-performance inter-thread communication (that you can
[quoted text clipped - 3 lines]
> Did something go wrong with that paragraph, possibly just over-ambitions
> packing? I can't make consistent sense of it.

That was indeed a buggy paragraph. :-)

Multithreaded GC (like the JVM and .NET) provides cheap interthread
communication. OCaml's approach (independent GCs per process) can make
interthread communication much more expensive but it scales better.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Mark Thornton - 14 Dec 2007 20:02 GMT
>>> There are complicated trade-offs involved here but, essentially, OCaml
>>> sacrifices high-performance inter-thread communication (that you can
[quoted text clipped - 8 lines]
> communication. OCaml's approach (independent GCs per process) can make
> interthread communication much more expensive but it scales better.

With multicore (shared memory) processors now the norm on the desk top,
the advantage of the Java approach is significant for some desktop
applications. The work I'm doing at the moment has threads sharing data
structures which run to hundreds of megabytes. Combined with the Doug
Lea's Fork/Join library (jsr166y) this is giving me good speed up over
sequential computation.
Separate processes not only impose higher communication overhead, but on
some architectures, the overhead of process creation is substantial.
Much as I would like to wean customers off Windows, my code has to work
efficiently on that platform.

Mark Thornton
Jon Harrop - 14 Dec 2007 21:53 GMT
> With multicore (shared memory) processors now the norm on the desk top,
> the advantage of the Java approach is significant for some desktop
> applications.

I'm interested in seeing evidence either way. From the benchmarks in this
thread, for example, I attribute the (up to) 5x slower allocation rates in
Java to the fact that its allocator and GC must cope with interthread
issues.

> The work I'm doing at the moment has threads sharing data
> structures which run to hundreds of megabytes. Combined with the Doug
> Lea's Fork/Join library (jsr166y) this is giving me good speed up over
> sequential computation.

Sure. The work I'm doing at the moment has many processes with large shared
data structures.

> Separate processes not only impose higher communication overhead, but on
> some architectures, the overhead of process creation is substantial.
> Much as I would like to wean customers off Windows, my code has to work
> efficiently on that platform.

Yes. Process creation in OCaml takes about 10ms here whereas thread creation
in F# takes 1ms.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Mark Thornton - 14 Dec 2007 23:29 GMT
>> With multicore (shared memory) processors now the norm on the desk top,
>> the advantage of the Java approach is significant for some desktop
[quoted text clipped - 4 lines]
> Java to the fact that its allocator and GC must cope with interthread
> issues.

The allocator is very trivial, almost all the time is in collection. The
comment seen elsewhere that dead objects are free to collect is not
quite true. If you have very high rates of garbage creation, the nursery
area will fill frequently and cause garbage collection to occur often.
There seems to be at least some fixed cost in gc, so if the live object
count/size is very small, the time for each gc cycle does not tend to
zero. Many cases of high rates of allocating short lived objects have a
recognisable block structure. The current JVM can detect this (escape
analysis) but so far only uses this to eliminate some types of locks.
Perhaps unfortunately a lot of production Java code has been hand
'optimised' to reduce such allocations, which means that when the
compiler people test escape analysis, the returns aren't as good as
hoped (programmers have already done it by hand). This technique is
apparently used in the (not free) JET jvm, and we hope it will finally
appear in Java 7.

As for multithread issues, the default garbage collector (note Sun's jvm
comes with a number of collectors which you can select using runtime
options), is a stop the world variety. As all threads are stopped (apart
from the collector itself) interthread issues are non existent. In
threaded applications this means that neither allocation or collection
have threading overheads (no synchronization in either case). OK, it
must take some time to suspend/resume the working threads. Compared to
the costs involved with thread safe heaps in C/C++ this can be a
significant gain. Note that the JVM can use simplified code when running
on a uniprocessor, i.e. multiprocessor costs only occur if you actually
run the application on such a machine.

> Sure. The work I'm doing at the moment has many processes with large
> shared data structures.

It is awkward to include pointers in the shared structures unless you
can ensure that the structure is mapped to the same address in each
process. Easy to arrange in a unix fork environment, more difficult to
ensure on Windows, especially if the processes are not effectively clones.

Incidentally your ray trace code makes unnecessary use of inner classes
as opposed to nested classes. Sprinkling a few 'static' keywords around
chops about 10% off the time of the simple version. Another 10% can be
saved by specifying a larger heap than default (on my machine the
default initial heap is about 5MB). These savings apply to both client
and server JVM's.

Mark Thornton
Jon Harrop - 14 Dec 2007 23:49 GMT
> As for multithread issues, the default garbage collector (note Sun's jvm
> comes with a number of collectors which you can select using runtime
> options), is a stop the world variety. As all threads are stopped (apart
> from the collector itself) interthread issues are non existent...

Stop-the-world is exactly the kind of interthread issue that I was referring
to.

If you take this symbolic simplifier, for example, and make it run two
simplifications at a time on my dual core then its performance will not
improve by a factor of two if the GC stops all threads while it runs.

In contrast, the OCaml is not only more than twice as fast to start with but
it will scale linearly with the number of cores in this case because the
GCs do not interfere.

>  > Sure. The work I'm doing at the moment has many processes with large
>  > shared data structures.
>
> It is awkward to include pointers in the shared structures unless you
> can ensure that the structure is mapped to the same address in each
> process.

There are no pointers in these languages though.

> Incidentally your ray trace code makes unnecessary use of inner classes
> as opposed to nested classes. Sprinkling a few 'static' keywords around
> chops about 10% off the time of the simple version. Another 10% can be
> saved by specifying a larger heap than default (on my machine the
> default initial heap is about 5MB). These savings apply to both client
> and server JVM's.

Can you send me your code and I'll update it.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Mark Thornton - 15 Dec 2007 09:07 GMT
>> As for multithread issues, the default garbage collector (note Sun's jvm
>> comes with a number of collectors which you can select using runtime
[quoted text clipped - 11 lines]
> it will scale linearly with the number of cores in this case because the
> GCs do not interfere.

Note that Sun's Java 6 also comes with a parallel garbage collector. I
didn't test that because my home machine is an old single core
processor. I notice from other comments that the performance of OCaml
relative to Java seems to depend quite a bit on the specific
environment. Some people are reporting Java as faster.

> Can you send me your code and I'll update it.
Done.

Mark Thornton
Lew - 15 Dec 2007 01:09 GMT
>> I'm not familiar with OCaml.  Does it have a library of concurrency
>> constructs, such as blocking / non-blocking queues, monitors or other
[quoted text clipped - 7 lines]
> back up the latter in general, outside of Erlang's overwhelming victory for
> massively-concurrent applications).

Message passing as a primitive coordination and concurrency mechanism is
extremely powerful, especially in an O.S. like QNX that reduces context-switch
times to ridiculously small values.  I first encountered it on a minicomputer
in 1981.  It makes for very straightforward coding, too.

Signature

Lew

Jon Harrop - 15 Dec 2007 18:45 GMT
>> No, OCaml has data parallel and message passing libraries instead. These
>> are higher level constructs for parallelism and concurrency that abstract
[quoted text clipped - 9 lines]
> minicomputer
> in 1981.  It makes for very straightforward coding, too.

Exactly, yes. However, OCaml does not do such optimizations (AFAIK). To be
fair, I haven't tried it though.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

ldv@mail.com - 14 Dec 2007 10:27 GMT
> > On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
> > <news.nos...@tilialacus.net> wrote, quoted or indirectly quoted
[quoted text clipped - 18 lines]
>
> //Roger Lindsjö

On my 2GHz Celeron the executable produced by the latest Excelsior JET
6.0 is much faster than HotSpot Client VM 1.6.0_3, but slower than the
Server VM, which is almost twice as fast as the Client VM. Here go the
best results selected from 10 consecutive runs:

HSS : 2.594s
JET : 3.25s
HSC : 5.172s

On a colleague's Core 2 Duo, the situation is more or less the same:

HSS : 1.206s
JET : 1.687s
HSC : 2.118s

Are you guys all using the Server VM in your comparisons?

LDV
Roger Lindsjö - 14 Dec 2007 10:46 GMT
>>> On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
>>> <news.nos...@tilialacus.net> wrote, quoted or indirectly quoted
[quoted text clipped - 31 lines]
>
> Are you guys all using the Server VM in your comparisons?

I'm using Server VM by default (or actually, the JVM does), the output
from just java contains:
The default VM is server,
because you are running on a server-class machine.

//Roger Lindsjö
Jon Harrop - 14 Dec 2007 10:46 GMT
> I'm using Server VM by default (or actually, the JVM does), the output
> from just java contains:
> The default VM is server,
> because you are running on a server-class machine.

Me too.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Roedy Green - 14 Dec 2007 22:11 GMT
On Thu, 13 Dec 2007 23:16:57 +0100, Roger Lindsjö
<news.nospam@tilialacus.net> wrote, quoted or indirectly quoted
someone who said :

>>> Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
>>> Java: 1.074 s
>>> Ocaml: 1.515 s

To my surprise Java -server skunked Jet which just edges out java
-client.  I gather the run-time knowledge -server can gather is paying
off big time here.

Athlon 64 X2 3800+ dual core, 2 gig ram.

java -client Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 4.422 seconds
Bench 1
Took 4.728 seconds
Bench 2
Took 4.408 seconds
[* x [+ 31 y]]

using jet
Simplify.exe 10000000 3
[E:\exper]Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 4.178 seconds
Bench 1
Took 4.193 seconds
Bench 2
Took 4.303 seconds
[* x [+ 31 y]]

java -server Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 2.91 seconds
Bench 1
Took 2.84 seconds
Bench 2
Took 2.84 seconds
[* x [+ 31 y]]

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Jon Harrop - 15 Dec 2007 00:02 GMT
> On Thu, 13 Dec 2007 23:16:57 +0100, Roger Lindsjö
> <news.nospam@tilialacus.net> wrote, quoted or indirectly quoted
[quoted text clipped - 19 lines]
> Took 2.84 seconds
> [* x [+ 31 y]]

Athlon 64 X2 4400+ dual core, 2Gb RAM

OCaml: 0.695s

Even accounting for CPU speed differences, OCaml is still running 3.5x
faster, primarily because it can allocate much more quickly than Java.

This is also neglecting the facts that Java is using 100x as much RAM and
doubles the load average (i.e. maxes out both of my cores whereas the OCaml
only uses one).

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Chris Dollin - 12 Dec 2007 13:58 GMT
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 8 lines]
>
>   http://www.lambdassociates.org/studies/study10.htm

[[CAVEAT: hacked together over a lunchtime or so, in no way any
 official statement of anything from anyone, etc.]]

OK, my broken-handed implementation running on a 3.06GHz Xeon with 2GB
memory (not that that matters much, since the JVM is only using the
default which is more like 64M) using FC5 with

 Java(TM) SE Runtime Environment (build 1.6.0-b105)
 Java HotSpot(TM) Client VM (build 1.6.0-b105, mixed mode, sharing)

and running (of course) with -server takes 2.498s, 2.94s, 3.002s,
for the 10000000-times loop (ie neglecting start-up times, which
according to `time` are about 0.2s). I don't know why there's so
much variability in the reported times.

Drat. OK, those times are using BigDecimal, not BigInteger, which
gets me 3.258s, 3.281s, 3.207s as the loop times. (Using plain
`int` gets those times down to 1.986s, 1.62s, 1.987s, as a measure
of the work done by the big-integer arithmetics.)

-server about halves the runtime. My initial -server-free code took
about 9s [with BigDecimal], but some straightforward optimisation work
(none of which comes close to the obvious caching cheat; just reordering
the tests and ensuring that the important values 0 and 1 were testable
with ==) reduced that.

I tried the obvious tactic of going for double-dispatches to avoid
multiple consecutive instanceof tests, but that made it /slower/,
perhaps because hotspot couldn't collapse code across method
boundaries? I don't really mind because in this case the multiple
dispatches made the code rather more long-winded than it already
was.

Count the line-size yourselves and subtract lines you deem not
relevant to the benchmark.

;;; -- code starteth here ----------------------------------

package temp;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertSame;

import java.math.BigDecimal;
import java.util.StringTokenizer;

import org.junit.Test;

/*
let rec ( +: ) f g = match f, g with
  | `Int n, `Int m -> `Int (n +/ m)
  | `Int (Int 0), e | e, `Int (Int 0) -> e
  | f, `Add(g, h) -> f +: g +: h
  | f, g -> `Add(f, g)

let rec ( *: ) f g = match f, g with
  | `Int n, `Int m -> `Int (n * / m)
  | `Int (Int 0), e | e, `Int (Int 0) -> `Int (Int 0)
  | `Int (Int 1), e | e, `Int (Int 1) -> e
  | f, `Mul(g, h) -> f *: g *: h
  | f, g -> `Mul(f, g)

let rec simplify = function
  | `Int _ | `Var _ as f -> f
  | `Add (f, g) -> simplify f +: simplify g
  | `Mul (f, g) -> simplify f * : simplify g

*/
public class SimplifyViaInstanceOf
   {    @Test public void ensureParseCompleteConstants()
       {
       assertSame( zero, parse( "0" ) );
       assertSame( one, parse( "1" ) );
       assertEquals( new Val( new BigDecimal( "17" ) ), parse( "17" ) );
       assertEquals( new Val( new BigDecimal( "1234567890" ) ), parse( "1234567890" ) );
       }
   
   @Test public void ensureParseCompleteNames()    
       {
       assertEquals( new Sym( "x" ), parse( "x" ) );
       assertEquals( new Sym( "y" ), parse( "y" ) );
       assertEquals( new Sym( "calico" ), parse( "calico" ) );
       assertEquals( new Sym( "whoot" ), parse( "whoot" ) );
       }
   
   @Test public void ensureParseSimpleExpression()
       {
       assertEquals( new Add( new Sym( "x" ), new Sym( "y" ) ), parse( "(x+y)" ) );
       assertEquals( new Add( one, new Sym( "y" ) ), parse( "(1+y)" ) );
       assertEquals( new Add( zero, new Sym( "y" ) ), parse( "(0+y)" ) );
       assertEquals( new Add( new Sym( "x" ), zero ), parse( "(x+0)" ) );
       assertEquals( new Mul( new Sym( "x" ), new Sym( "y" ) ), parse( "(x*y)" ) );
       assertEquals( new Mul( new Sym( "z" ), zero ), parse( "(z*0)" ) );
       assertEquals( new Mul( zero, new Sym( "x" ) ), parse( "(0*x)" ) );
       }
   
   @Test public void ensureParseCompoundExpression()
       {
       assertEquals( new Mul( parse( "(x+y)" ), one  ), parse( "((x+y)*1)" ) );
       assertEquals( new Mul( parse( "(x+y)" ), zero  ), parse( "((x+y)*0)" ) );
       assertEquals( new Mul( parse( "(x+y)" ), new Sym( "x" )  ), parse( "((x+y)*x)" ) );
       assertEquals( new Mul( parse( "(x+y)"), parse( "(0*1)" ) ), parse( "((x+y)*(0*1))" ) );
       }
   
   private Exp parse( String string )
       {
       StringTokenizer st = new StringTokenizer( string, "+*()", true );
       return parse( st, st.nextToken() );
       }

   private Exp parse( StringTokenizer st, String current )
       {
       char ch = current.charAt(0);
       if (ch == '(')
           {
           Exp L = parse( st, st.nextToken() );
           ch = st.nextToken().charAt( 0 );
           Exp R = parse( st, st.nextToken() );
           String ket = st.nextToken();
           if (!ket.equals( ")" )) throw new RuntimeException( "ket needed: " + ket );
           return ch == '*' ? new Mul( L, R ) : new Add( L, R );
           }
       else if (Character.isDigit( ch ))
           return Val.create( new BigDecimal( current ) );
       else if (current.equals( "*" ) || current.equals( "+" ) || current.equals( " )" ))
           throw new RuntimeException( "misplaced punctuation: " + current );
       else
           return new Sym( current );
       }

   @Test public void ensureSimplifiesAddZero()
       {
       assertEquals( parse( "x" ), parse( "(x+0)" ).simplify() );
       assertEquals( parse( "x" ), parse( "(0+x)" ).simplify() );
       }
   
   @Test public void ensureAddsLiterals()
       {
       assertEquals( parse( "3" ), parse( "(1+2)" ).simplify() );
       assertSame( zero, parse( "0+0" ).simplify() );
       }
   
   @Test public void ensureMultipliesLiterals()
       {
       assertEquals( parse( "15" ), parse( "(5*3)" ).simplify() );
       assertSame( zero, parse( "(0*0)" ).simplify() );
       assertSame( zero, parse( "(x*0)" ).simplify() );
       assertSame( zero, parse( "(0*y)" ).simplify() );
       }
   
   @Test public void ensureSimplifiesMulZero()
       {
       assertSame( zero, parse( "(x*0)" ).simplify() );
       assertSame( zero, parse( "(0*x)" ).simplify() );
       }
   
   @Test public void ensureSimplifiesMulOne()
       {
       assertEquals( parse( "x" ), parse( "(x*1)" ).simplify() );
       assertEquals( parse( "x" ), parse( "(1*x)" ).simplify() );
       }
   
   @Test public void ensureLeftAssociatesAdds()
       {
       assertEquals( parse( "((x+y)+z)" ), parse( "(x+(y+z))" ).simplify() );
       }
   
   @Test public void ensureLeftAssociatesMuls()
       {
       assertEquals( parse( "((x*y)*z)" ), parse( "(x*(y*z))" ).simplify() );
       }
   
   static abstract class Exp
       {
       abstract Exp simplify();
       }
   
   static final Val zero = new Val( BigDecimal.ZERO );
   static final Val one = new Val( BigDecimal.ONE );
   
   static class Val extends Exp
       {
       final BigDecimal val;
       
       Val( BigDecimal val ) { this.val = val; }
       
       Exp simplify() { return this; }
       
       public String toString() { return val.toString(); }
       
       public boolean equals( Object other )
           { return other instanceof Val && val.equals( ((Val) other).val ); }
       
       public static Exp create( BigDecimal d )
           { return d.equals( BigDecimal.ZERO ) ? zero : d.equals(  BigDecimal.ONE ) ? one : new Val( d ); }
       }
   
   static class Sym extends Exp
       {
       String name;
       Sym( String name ) { this.name = name; }
       Exp simplify() { return this; }
       public boolean equals( Object other )
           { return other instanceof Sym && name.equals( ((Sym) other ).name ); }
       public String toString() { return name; }
       }
   
   static abstract class Inf extends Exp
       {
       final Exp L, R;
       Inf( Exp L, Exp R ) { this.L = L; this.R = R; }
       }
   
   static class Add extends Inf
       {
       Add( Exp L, Exp R ) { super( L, R ); }
       Exp simplify()
           {
           return sum( L.simplify(), R.simplify() );
           }
       private Exp sum( Exp sL, Exp R )
           {
           if (sL ==zero) return R;
           if (R == zero) return sL;
           if (sL instanceof Val && R instanceof Val) return Val.create( ((Val) sL).val.add( ((Val) R).val ) );
           if (R instanceof Add) return sum( sum( sL, ((Add) R).L ), ((Add) R).R );
           return new Add( sL, R );
           }
       public String toString() { return "(" + L + " + " + R + ")"; }
       
       public boolean equals( Object other )
           { return other instanceof Add && same( (Add) other ); }
       
       private boolean same( Add other )
           { return L.equals( other.L ) && R.equals( other.R ); }
       }
   
   static class Mul extends Inf
       {
       Mul( Exp L, Exp R ) { super( L, R ); }        
       Exp simplify(){ return prod( L.simplify(), R.simplify() ); }
       private Exp prod( Exp L, Exp R )
           {
           if (L == zero || R ==zero) return zero;
           if (L == one) return R;
           if (R == one) return L;
           if (L instanceof Val && R instanceof Val) return Val.create( ((Val) L).val.multiply( ((Val) R).val ) );
           if (R instanceof Mul) return prod( prod( L, ((Mul) R).L ), ((Mul) R).R );
           return new Mul( L, R );
           }
       public String toString() { return "(" + L + " * " + R + ")"; }
       
       public boolean equals( Object other )
           { return other instanceof Mul && same( (Mul) other ); }
       
       private boolean same( Mul other )
           { return L.equals( other.L ) && R.equals( other.R ); }
       }
   
   public static void main( String [] args )
       {
       // [* x [+ [+ [* 12 0] [+ 23 8]] y]]
       Exp a28_8 = new Add( Val.create( new BigDecimal( 23 ) ), Val.create( new BigDecimal( 8 ) ) );
       Exp m12_0 = new Mul( Val.create( new BigDecimal( 12 ) ), Val.create( new BigDecimal( 0 ) ) );
       Exp aAB = new Add( m12_0, a28_8 );
       Exp aABY = new Add( aAB, new Sym( "y" ) );
       Exp it = new Mul( new Sym( "x" ), aABY );
       Exp itx = new SimplifyViaInstanceOf().parse( "(x*(((12*0)+(23+8))+y))" );
       System.err.println( ">> original expression: " + it );
       System.err.println( ">> original expression: " + itx );
       System.err.println( ">> simplified expression: " + itx.simplify() );
       long base = System.currentTimeMillis();
       for (int i = 0; i < 10000000; i += 1) it.simplify();
       long after = System.currentTimeMillis();
       System.err.println( ">> took " + (after - base)/1000.0 + "s" );
       }
   }

Signature

Hewlett-Packard Limited registered office:                Cain Road, Bracknell,
registered no: 690597 England                                    Berks RG12 1HN

Jon Harrop - 12 Dec 2007 15:44 GMT
> OK, my broken-handed implementation running on a 3.06GHz Xeon with 2GB
> memory (not that that matters much, since the JVM is only using the
[quoted text clipped - 7 lines]
> according to `time` are about 0.2s). I don't know why there's so
> much variability in the reported times.

Great.

> Drat. OK, those times are using BigDecimal, not BigInteger, which
> gets me 3.258s, 3.281s, 3.207s as the loop times. (Using plain
> `int` gets those times down to 1.986s, 1.62s, 1.987s, as a measure
> of the work done by the big-integer arithmetics.)

Let's forget about the arbitrary-precision stuff and focus on the rest for
now. On my machine, your original code gives 2.853. Changing to
machine-precision ints that drops to 1.508s. The equivalent OCaml takes
only 0.711s.

> -server about halves the runtime. My initial -server-free code took
> about 9s [with BigDecimal], but some straightforward optimisation work
> (none of which comes close to the obvious caching cheat; just reordering
> the tests and ensuring that the important values 0 and 1 were testable
> with ==) reduced that.

Yes, the OCaml compiler will automatically factor common subexpressions like
the constant integer expressions.

> I tried the obvious tactic of going for double-dispatches to avoid
> multiple consecutive instanceof tests, but that made it /slower/,
> perhaps because hotspot couldn't collapse code across method
> boundaries? I don't really mind because in this case the multiple
> dispatches made the code rather more long-winded than it already
> was.

Interesting.

> Count the line-size yourselves and subtract lines you deem not
> relevant to the benchmark.

I get 175 LOC for Roger's Java code, 117 LOC for your Java and 40 LOC for my
OCaml.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Steve Wampler - 12 Dec 2007 22:49 GMT
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
[quoted text clipped - 8 lines]
>
>   http://www.lambdassociates.org/studies/study10.htm

Here's another Java one.  I didn't bother with parsing an
input string because it looks like people were timing
just the time to walk the parse tree and simplify it.

I've been hesitant to post it because it feels like cheating
(and the code isn't very OO - I actually tend to think in
a different language for these puzzles and then cast the
result into Java).

Here are times on my dual-Opteron@2GHz, using BigInteger,
each run does 10,000,000 simplifications and then outputs
the original string and the simplified version (taken
from the last run):

------------------------------------------------------
time java -server Simplify 10000000 10
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0090
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0010
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
java -server Simplify 10000000 10  0.11s user 0.03s system 103% cpu 0.135 total
------------------------------------------------------

Why so fast?  Because the simplified string is built up as the parse
is constructed.  So the simplification loop just outputs the result!
I'll grant that this isn't in the spirit of things, but I couldn't see
where it was prohibited :)  After all, if you're allowed to simplify
while building the parse tree, this isn't much more of a stretch...

(Naturally, this would work when parsing of input strings is done instead
of hand-building the parse tree, as well.)

-----------------------------------------------------
import java.math.BigInteger;
import java.util.Date;

public class Simplify {

    private static BigInteger zero = new BigInteger("0");
    private static BigInteger one  = new BigInteger("1");

    private class Expr {
        public String type;
        public Expr left;
        public Expr right;
        public BigInteger iVal;
        public String sVal;

        public  Expr(int nVal) {
            type = "int";
            iVal = new BigInteger(""+nVal);
            sVal = iVal.toString();
            }

        public Expr(String var) {
            type = "var";
            sVal = var;
            }

        public Expr(String nType, Expr nLeft, Expr nRight) {
            if (nType == "+") {
                if (nLeft.eq(zero)) { mkCopy(nRight); }
                else if (nRight.eq(zero)) { mkCopy(nLeft); }
                else if ((nLeft.type == "int") && (nRight.type == "int")) {
                    type = "int";
                    iVal = nLeft.iVal.add(nRight.iVal);
                    sVal = iVal.toString();
                    }
                else lassoc(nType, nLeft, nRight);
                }
            if (nType == "*") {
                if (nLeft.eq(one)) { mkCopy(nRight); }
                else if (nRight.eq(one)) { mkCopy(nLeft); }
                else if ((nLeft.type == "int") && (nRight.type =="int")) {
                    type = "int";
                    iVal = nLeft.iVal.multiply(nRight.iVal);
                    sVal = iVal.toString();
                    }
                else lassoc(nType, nLeft, nRight);
                }
            }

        private void lassoc(String nType, Expr nLeft, Expr nRight) {
            if (nRight.type == nType) {
                type = nType;
                left = new Expr(nType, nLeft, nRight.left);
                right = nRight.right;
                sVal = "["+nType+" "+left+" "+right+"]";
                }
            else mkCopy(nType, nLeft, nRight, zero,
                    "["+nType+" "+nLeft+" "+nRight+"]");
            }

        private void mkCopy(Expr e) {
            mkCopy(e.type, e.left, e.right, e.iVal, e.sVal);
            }

        private void mkCopy(String nType, Expr nLeft, Expr nRight,
                            BigInteger nVal, String nsVal) {
            type = nType;
            left = nLeft;
            right = nRight;
            iVal = nVal;
            sVal = nsVal;
            }

        public boolean eq(BigInteger i) {
            return (type == "int") && (iVal.equals(i));
            }

        public String toString() { return sVal; }

        public String simplify() { return sVal; }
        }

    public static String simplify(Expr expr) {
        return expr.simplify();
        }

    public void runTest(int limit) {
        String s = "[* x [+ [+ [* 12 0][+ 23 8]] y ]]";
        System.out.print(""+limit+" '"+s+"' -> ");
        Expr e = new Expr("*",
                   new Expr("x"),
                   new Expr("+",
                     new Expr("+",
                       new Expr("*", new Expr(12), new Expr(0) ),
                       new Expr("+", new Expr(23), new Expr(8) ) ) ,
                     new Expr("y") ) );
        Date start = new Date();
        String newS = null;
        for (int i = 0; i < limit; ++i) {
            newS = simplify(e);
            }
        Date stop = new Date();
        System.out.println("'"+newS+"':: "+
                  (stop.getTime() - start.getTime())/1000.0);
        }

    public static void main(String[] args) {
        int limit = 10000000;
        if (args.length > 0) limit = Integer.parseInt(args[0]);
        int nRuns = 10;
        if (args.length > 1) nRuns = Integer.parseInt(args[1]);

        Simplify simplify = new Simplify();
        for (int i = 0; i < nRuns; ++i) {
            simplify.runTest(limit);
            }
        System.exit(0);
        }
    }
-----------------------------------------------------
Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Chris Dollin - 13 Dec 2007 08:36 GMT
one.  I didn't bother with parsing an
> input string because it looks like people were timing
> just the time to walk the parse tree and simplify it.
[quoted text clipped - 3 lines]
> a different language for these puzzles and then cast the
> result into Java).

> Why so fast?  Because the simplified string is built up as the parse
> is constructed.  So the simplification loop just outputs the result!
> I'll grant that this isn't in the spirit of things, but I couldn't see
> where it was prohibited :)  After all, if you're allowed to simplify
> while building the parse tree, this isn't much more of a stretch...

I thought of that and decided it wasn't in the spirit of the test; it
does show that without the context the problem originally came from,
one doesn't know if it's a realistic optimisation or not, and it shows
how algorithmic changes can beat mere [micro] optimisation. The other
trick I thought of, almost but not quite a similar complete cheat, is
to give each node a slot to hold its simplified form and store it the
first time it gets computed. In this case, that would mean that the
first pass through the loop would do all the computation and the remaining
passes would be trivial ...

[I know Jon wanted immutable nodes, but these /would/ be immutable nodes;
there's no external visible change to the behaviour except performance.
Also, given that Java doesn't even pretend to be functional, disallowing
assignments could be seen as bias ... I do think it's sensible to make
the expression nodes /functionally/ immutable.]

Signature

Chris "puns is us" Dollin

Hewlett-Packard Limited     Cain Road, Bracknell,                registered no:
registered office:          Berks RG12 1HN                       690597 England

Steve Wampler - 13 Dec 2007 12:23 GMT
> The other
> trick I thought of, almost but not quite a similar complete cheat, is
> to give each node a slot to hold its simplified form and store it the
> first time it gets computed. In this case, that would mean that the
> first pass through the loop would do all the computation and the remaining
> passes would be trivial ...

Yes, that's essentially what my original (unpublished) solution (which includes
the parsing of the string inside every iteration of the inner timing loop) does.  I
wrote it with a string transformation approach instead of thinking of it
as a parse tree approach and used caching to avoid reparsing any
previously parsed substring.  Since every string (including the
input!) is a substring of itself, this means that it is slow on the first
iteration and reduces to a cache lookup on iterations 2-10,000,000
for this 'benchmark'.

So, the times, where the input string is passed to the simplifier on
each iteration are roughly 0.12 ms per 10,000,000 iterations.  The
parsing code is really a quick and (very) dirty hack, so I decided
not to put it out.

I'm still debating with myself about whether or not saving the
simplified string is a cheat or not - it's just a representation of
the parse tree *at that point*, so if you're allowed to simplify
while building the parse tree, why not save its representation
as well?  After all, without it, the challenge really reduces to
how fast you can recurse, concatenate strings, and convert big
integers to strings - not really do symbolic math manipulations.

Incidently, I noticed I had written

   public Expr(int i) {...

where it should be:

   public Expr(BigInteger i)

instead.  (And the hand-built parse tree would have to use "new BigInteger(...)" instead
of int literals).  Doesn't affect the timings enough to notice.

Also, the code could be cleaned up...and made more OO.  Then it would be pretty much the
same as your solution, of course.

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Chris Dollin - 13 Dec 2007 13:34 GMT
(stuff about caching, and at the end of Steve's response he wrote:)

> Also, the code could be cleaned up...and made more OO.  Then it would be
> pretty much the same as your solution, of course.

Could easily be better. I don't like using `instanceof` -- it doesn't
seem to me to be in the spirit of an OO language. Type decisions should
be made by polymorphism. But, as I said, my attempt at using polymorphism
via multiple-dispatch was slower than my instanceof code. Perhaps I gave
up too early ...

Signature

Chris "giving in until he gets his own way" Dollin

Hewlett-Packard Limited registered office:                Cain Road, Bracknell,
registered no: 690597 England                                    Berks RG12 1HN

Jon Harrop - 13 Dec 2007 20:42 GMT
>> Also, the code could be cleaned up...and made more OO.  Then it would be
>> pretty much the same as your solution, of course.
[quoted text clipped - 4 lines]
> via multiple-dispatch was slower than my instanceof code. Perhaps I gave
> up too early ...

Yes, this is exactly where OOP ceases to be useful. I believe the idiomatic
OOP solution is to implement lots of different member functions reflecting
pieces of pattern match. For example, you would have an add_int member
function to add an int and specialize the add_int member of the Int class
to create a new Int.

I might have a go but I've no idea what would be efficient in Java.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Jon Harrop - 13 Dec 2007 20:36 GMT
> [I know Jon wanted immutable nodes...

I don't mind mutable nodes but I don't want simplification to destroy its
input expression.

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Jon Harrop - 13 Dec 2007 20:39 GMT
> Why so fast?  Because the simplified string is built up as the parse
> is constructed.  So the simplification loop just outputs the result!
> I'll grant that this isn't in the spirit of things, but I couldn't see
> where it was prohibited :)  After all, if you're allowed to simplify
> while building the parse tree, this isn't much more of a stretch...

To be honest, I'm not really sure how this relates to the original challenge
or to my OCaml. Specifically, I haven't used any strings or done any
parsing.

The previous Java implementations weren't benchmarking parsing, were they?

Signature

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Steve Wampler - 13 Dec 2007 21:43 GMT
>> Why so fast?  Because the simplified string is built up as the parse
>> is constructed.  So the simplification loop just outputs the result!
[quoted text clipped - 7 lines]
>
> The previous Java implementations weren't benchmarking parsing, were they?

That's correct.   I had originally assumed that the program should work
for different expressions and not have expression trees hard-coded.  So
I took that to mean that some form of string input was needed so that
expressions could be parsed into the expression tree.  That led me to
a string transformation solution as a first try.  When I saw the C++
solution to the derivative solution you offered as an example of something
similar, I realized that there was no need to read a string and parse it.
But my earlier misinterpretation keeps me coming back to referring to the
hard-coded internal representation as a parse tree (that, and my previous
life writing compilers).

Since the simplification can be done when that hard-coded parse tree
is built, the timing loop to simplify the constructed tree actually
doesn't have to do anything to get the simplified expression at all -
it's already there.  (My statement about the loop just outputting the
result isn't true, that's not needed either, after all!)  Colin's solution,
for example, just walks the tree and builds a copy of it.  But I'm
not convinced that's necessary - you could just return the tree without
a copy.  (I went ahead and returned that tree in a string representation,
because that made it easier for me to test things, and I could do that
step quickly.)

[The version I posted had a error in the string representation of the
result of converting right associativity to left (teaches me to code
while battling a mild case of insomnia...).  I found and fixed that
while cleaning up the code.]

It would probably be more interesting to include the code that
constructs the expression tree in the timing loop.  You may very
well have had that in mind, but that's not what I'd seen tested.

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Steve Wampler - 13 Dec 2007 22:10 GMT
> Since the simplification can be done when that hard-coded parse tree
> is built, the timing loop to simplify the constructed tree actually
> doesn't have to do anything to get the simplified expression at all -
> it's already there.

Incidently, I think Java figured that out also.  Here are the times
I get for 10 runs of 2,000,000,000:

->time java -server Simplify 2000000000 10
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.01
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0010
java -server Simplify 2000000000 10  0.11s user 0.02s system 102% cpu 0.127 total
->

I suspect that means it's figured out I have a loop invariant.  Apparently
it has more trouble figuring that out when you actually do something.
Probably because new objects are built while 'cloning' the tree.

Signature

Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Roedy Green - 13 Dec 2007 21:04 GMT
> high allocation rates for short-lived objects when
>written in an idiomatic functional style and Java should be uniquely bad at
>this.

Not at all.  Java can allocate objects very quickly;, it just peels
them off a giant contiguous block.  What takes the time is when you
have a large number of long lived objects since they have to be
enumerated each GC pass.  Short lived objects are recovered in zero
time.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Jon Harrop - 13 Dec 2007 23:27 GMT
>> high allocation rates for short-lived objects when
>> written in an idiomatic functional style and Java should be uniquely bad
[quoted text clipped - 5 lines]
> enumerated each GC pass.  Short lived objects are recovered in zero
> time.

Then you'll be able to improve the Java implementations of these benchmarks
so that they