Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2006

Tip: Looking for answers? Try searching our database.

Multithreading / Scalability

Thread view: 
Philipp Kayser - 05 Feb 2006 18:28 GMT
Hi,

I wrote a small program to test scalability in a multiprocessor
environment (in my case an Athlon 64 X2). I included the source below.

To my surprise the calculation does not run faster if I use 2 threads
(which should be the case if I have 2 processors) but it runs 5 times
slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i
use 1 thread, I have a total CPU usage of 50%, if use two threads I get
100%.
The best thing is: if I limit the JVM to one processor by setting the
CPU affinity for the process, but again take 2 threads, the calculation
runs only 2 times slower (6s instead of 3s for one thread).

My current current diagnosis is that it may have something to do with
the CPU cache. I searched the Internet for similar problems and found
the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
always switch between the two threads and by doing so their CPU cache
gets flushed every time.

Does anyone have an idea?

Best regards,
Philipp Kayser.

public class Test
{
private final int number_of_threads = 2;
private static int number_of_finished_threads;
private Thread threads[] = new Thread[number_of_threads];
double result[] = new double[number_of_threads];
private Thread main_thread;

private class CalculationThread extends Thread
{
 int thread_number;

 CalculationThread(int n)
 {
  super();
  thread_number = n;
 }

 public void run()
 {
  try
  {
   synchronized (this)
   {
    while (true)
    {
     wait();

     int n = 0;

     for (n = 0; n < 600000000; n++)
      if (n % number_of_threads == thread_number)
       result[thread_number] += Math.sqrt(n);

     synchronized (main_thread)
     {
      number_of_finished_threads++;
      main_thread.notify();
     }
    }
   }
  }
  catch (InterruptedException e)
  {
  }
 }
}

private void multithreaded_calculation()
{
 synchronized (main_thread)
 {
  number_of_finished_threads = 0;

  int i;
  for (i = 0; i < number_of_threads; i++)
  {
   synchronized (threads[i])
   {
    threads[i].notify();
   }
  }

  do
  {
   try
   {
    main_thread.wait();
   }
   catch (InterruptedException e)
   {
   }
  }
  while (number_of_finished_threads < number_of_threads);

  double total_result = 0;
  for (i = 0; i < number_of_threads; i++)
   total_result += result[i];

  System.out.println(total_result);
 }
}

private  void test()
{
 main_thread = Thread.currentThread();

 int i;
 for (i = 0; i < number_of_threads; i++)
 {
  threads[i] = new CalculationThread(i);
  threads[i].setPriority(Thread.NORM_PRIORITY);
  threads[i].setDaemon(true);
  threads[i].start();
 }

 try
 {
  Thread.sleep(1000);
 }
 catch (InterruptedException e)
 {
 }

 long t0 = new Date().getTime();

 multithreaded_calculation();

 long t1 = new Date().getTime();

 System.out.println(((double)t1 - t0)/1000);
}

public static void main(String[] args) {
 new Test().test();
}
}
Thomas Hawtin - 05 Feb 2006 20:20 GMT
> I wrote a small program to test scalability in a multiprocessor
> environment (in my case an Athlon 64 X2). I included the source below.
[quoted text clipped - 7 lines]
> CPU affinity for the process, but again take 2 threads, the calculation
> runs only 2 times slower (6s instead of 3s for one thread).

> My current current diagnosis is that it may have something to do with
> the CPU cache. I searched the Internet for similar problems and found
> the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
> always switch between the two threads and by doing so their CPU cache
> gets flushed every time.

Highly unlikely. Your program actively uses very little memory, so it
isn't going to be a problem with cache capacity.

>  private class CalculationThread extends Thread

It is rarely a good idea to override Thread. There is no need for
inheritance.

>   public void run()
>   {
>    try
>    {
>     synchronized (this)
>     {

Using a complex object like Thread as a lock can be a bad idea. Perhaps
the class itself uses it as a lock. In the case of Thread, Sun JRE does.

>      while (true)
>      {
>       wait();

When they say always put wait in a while loop, don't do it like this.
You are not guaranteed that the wait wont wake up early. Check a
condition in an enclosing while loop.

>       int n = 0;
>
>       for (n = 0; n < 600000000; n++)
>        if (n % number_of_threads == thread_number)
>         result[thread_number] += Math.sqrt(n);

Although you would expect the square root to dominate, this loop is
simpler if number_of_threads is one. Exactly how it will get optimised,
I have no idea.

Some multicore chips share FPUs. Niagra does, but I don't think Athlons
do. Such code is obviously not going to work so well on shared FPU
processors.

>    threads[i].setDaemon(true);

Making the threads daemon only masks your bug above, where threads do
not exit.

>    threads[i].start();
>   }
[quoted text clipped - 6 lines]
>   {
>   }

Looks odd. I guess you don't want to include the thread start up time.
The other issue is that you have a potential deadlock.

>   long t0 = new Date().getTime();

System.currentTimeMillis() is more conventional. Or System.nanoTime()
from 1.5.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Philipp Kayser - 05 Feb 2006 21:46 GMT
Hi Thomas,

thanks for your response.

>>  private class CalculationThread extends Thread
> It is rarely a good idea to override Thread. There is no need for
> inheritance.
> ...
> Using a complex object like Thread as a lock can be a bad idea. Perhaps
> the class itself uses it as a lock. In the case of Thread, Sun JRE does.
I changed class CalculationThread to be of type Runnable and also
changed main_thread to be of type Object. So i removed all locks on
Thread objects. Same result as before.

>>       for (n = 0; n < 600000000; n++)
>>        if (n % number_of_threads == thread_number)
>>         result[thread_number] += Math.sqrt(n);
> Although you would expect the square root to dominate, this loop is
> simpler if number_of_threads is one. Exactly how it will get optimised,
> I have no idea.
I also considered this. I changed the loop to:

for (n = 0; n < 30000; n++)
if (n % number_of_threads == thread_number)
 {
  double a = 2;
  for (int p = 0; p < n; p++)
  {
   a += Math.sqrt(a);
   result[thread_number] += a;
  }
 }

I dont think the inner loop can be optimized. The result changed a bit:

2 CPUs / 1 Threads : 8.75s
2 CPUs / 2 Threads : 9.35s

Although the result is better than before, 2 CPUs are again slower than
1 CPU. I would expect 2 CPUs too be nearly 2x faster than 1 CPU in such
a simple algorithm.

Best regards,
Philipp Kayser.
Daniel Dyer - 05 Feb 2006 22:12 GMT
> I dont think the inner loop can be optimized. The result changed a bit:
>
[quoted text clipped - 7 lines]
> Best regards,
> Philipp Kayser.

Are you running your app with the -server option?  Is there any  
significant difference between performance with the server and client VMs?

Dan.

Signature

Daniel Dyer
http://www.dandyer.co.uk

Philipp Kayser - 05 Feb 2006 22:56 GMT
Hi,

> Are you running your app with the -server option?  Is there any
> significant difference between performance with the server and client VMs?
I think there is no server JVM for Windows. The "-server" option gives
me an error.

Best regards,
Philipp.
Daniel Dyer - 05 Feb 2006 23:51 GMT
> Hi,
>
[quoted text clipped - 3 lines]
> I think there is no server JVM for Windows. The "-server" option gives
> me an error.

For some reason the server VM is only included with the JDK, not the JRE.  
So if you use the JRE java.exe it will give you an error with "-server".

Dan.

Signature

Daniel Dyer
http://www.dandyer.co.uk

Philipp Kayser - 06 Feb 2006 20:28 GMT
Hi,

> For some reason the server VM is only included with the JDK, not the
> JRE.   So if you use the JRE java.exe it will give you an error with
> "-server".
Ok, this works. But I noticed no better performance for the
"false-sharing" problem.
However, the server JVM seems to be significant faster (about 25%) than
the client JVM. Thanks for this hint.

Best regards,
Philipp.
Roedy Green - 06 Feb 2006 00:51 GMT
>I think there is no server JVM for Windows. The "-server" option gives
>me an error.

You have to use the java.exe in the JDK not the JRE. It understands
-server.  I have been using it for the great signum bakeoff.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Philipp Kayser - 05 Feb 2006 22:14 GMT
Hi again,

I think I found the problem. I changed the loop again to:

for (n = 0; n < 10000; n++)
if (n % number_of_threads == thread_number)
{
 double a = 2;
 for (int p = 0; p < n; p++)
 {
   a += Math.sin(a);
 }
 result[thread_number] += a;
}

Now I get satisfying resuls:

2 CPUs / 1 Thread  : 10.7s
2 CPUs / 2 Threads : 5.469s

I think the problem is the read/write-access to the result-array. If the
result-array is changed on one CPU, the cache on the second CPU gets
invalid. By pulling out the result-assignment of the inner loop, the
amount of cache invalidations are reduced greatly.

Best regards,
Philipp Kayser.
blmblm@myrealbox.com - 05 Feb 2006 22:24 GMT
>Hi again,
>
[quoted text clipped - 20 lines]
>invalid. By pulling out the result-assignment of the inner loop, the
>amount of cache invalidations are reduced greatly.

"False sharing"!  wish I'd noticed this post before composing my reply
of a few minutes ago.

But the comment (in the other post) about improving the loop might
still be worthwhile.

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
blmblm@myrealbox.com - 06 Feb 2006 09:25 GMT
>>Hi again,
>>
[quoted text clipped - 23 lines]
>"False sharing"!  wish I'd noticed this post before composing my reply
>of a few minutes ago.

Following up a little, after doing some experiments on a four-processor
machine at work ....:

My idea for getting rid of the "false sharing" cache problem was
to do the original calculation summing into a local variable,
and then add into result[thread_number] at the end.  That should
produce similar results to what you're doing above ....

And it does help -- performance changes from "the more threads, the
slower the program" to reasonable speedups with 2 and 4 threads,
compared to 1.

But ....

>But the comment (in the other post) about improving the loop might
>still be worthwhile.

By accident I discovered that making this change to the original
calculation, *instead of* making the switch to summing into a local
variable *actually produces a faster program*, with good speedups
for 2 and 4 threads.

I don't understand this at all.  Maybe I've made some stupid
blunder.  Otherwise -- hm, I don't know!

Below is my modified version of your code.  I did make some other
changes -- simplified to have the main thread use "join" to wait
for the calculation threads (did you know you could do that?) and
to move all thread activity into the timed part of the code so
I don't have to do the slightly complicated stuff to wait until all
threads are started before starting the timed part ...

public class Test
{
 private static int number_of_threads;
 private Thread threads[] = new Thread[number_of_threads];
 double result[] = new double[number_of_threads];

 private class CalculationThread implements Runnable
 {
   int thread_number;

   CalculationThread(int n)
   {
     thread_number = n;
   }

   public void run()
   {
     /*
     double local_result = 0.0;
     for (int n = 0; n < 600000000; n++)
       if (n % number_of_threads == thread_number)
         local_result += Math.sqrt(n);
     result[thread_number] += local_result;
     */
     // this actually is faster! ????
     for (int n = thread_number; n < 600000000; n+=number_of_threads)
       result[thread_number] += Math.sqrt(n);
   }
 }

 private void multithreaded_calculation()
 {
   for (int i = 0; i < number_of_threads; i++)
   {
     threads[i] = new Thread(new CalculationThread(i));
     //threads[i].setPriority(Thread.NORM_PRIORITY);
     //threads[i].setDaemon(true);
     threads[i].start();
   }

   try
   {
     for (int i = 0; i < number_of_threads; i++)
       threads[i].join();
   }
   catch (InterruptedException e)
   {
   }

   double total_result = 0;
   for (int i = 0; i < number_of_threads; i++)
     total_result += result[i];

   System.out.println(total_result);
 }

 private  void test()
 {
   long t0 = System.currentTimeMillis();

   multithreaded_calculation();

   long t1 = System.currentTimeMillis();

   System.out.println(((double)t1 - t0)/1000);
 }

 public static void main(String[] args) {
   number_of_threads = Integer.parseInt(args[0]);
   new Test().test();
 }
}

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Philipp Kayser - 06 Feb 2006 10:13 GMT
Hi,

> My idea for getting rid of the "false sharing" cache problem was
> to do the original calculation summing into a local variable,
[quoted text clipped - 4 lines]
> slower the program" to reasonable speedups with 2 and 4 threads,
> compared to 1.
Yes, I can duplicate this results. The two Thread stacks (where local
variables are being held) seem to be far enough away from each other to
avoid the "false sharing"-problem.

> But ....
> By accident I discovered that making this change to the original
[quoted text clipped - 3 lines]
> I don't understand this at all.  Maybe I've made some stupid
> blunder.  Otherwise -- hm, I don't know!
I can also duplicate this result. And I also don't have an explanation
for it. Maybe it has to do with code optimization because of the now
simpler code: the JVM holds "result[thread_number]" in a machine
register and later writes the value back to memory when the calculation
has finished.

> Below is my modified version of your code.  I did make some other
> changes -- simplified to have the main thread use "join" to wait
> for the calculation threads (did you know you could do that?)...
My intention was to avoid unnecessary thread creations. In this case it
is no problem because the threads are only started once. But I am
currently working on parallelising a bigger program, where this
"multithreaded_calculation" is started over and over again.
But maybe thread creation is not a big topic for an OS so this
optimization is not really needed.
To wait 1s for the threads to reach the wait() is  very hacky I know,
but I found no better solution (except of the join-solution).

Best regards,
Philipp.
blmblm@myrealbox.com - 06 Feb 2006 11:11 GMT
[ snip ]

>> But ....
>> By accident I discovered that making this change to the original
[quoted text clipped - 8 lines]
>register and later writes the value back to memory when the calculation
>has finished.

That's a very plausible explanation -- it's maybe a little puzzling
that this optimization wouldn't have also been done with the original
version, but .... <shrug>.

>> Below is my modified version of your code.  I did make some other
>> changes -- simplified to have the main thread use "join" to wait
[quoted text clipped - 7 lines]
>To wait 1s for the threads to reach the wait() is  very hacky I know,
>but I found no better solution (except of the join-solution).

There's some nice stuff in the java.util.concurrent package, added
in Java 1.5 (5.0), for creating "thread pools", which sounds like
what you want.

Doing some quick Googling, here's a tutorial that seems reasonable,
with some examples:

 http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/concurr
encyTools.html


| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
blmblm@myrealbox.com - 05 Feb 2006 22:22 GMT
>> I wrote a small program to test scalability in a multiprocessor
>> environment (in my case an Athlon 64 X2). I included the source below.
[quoted text clipped - 16 lines]
>Highly unlikely. Your program actively uses very little memory, so it
>isn't going to be a problem with cache capacity.

But there might be a problem with something called "false sharing",
which happens when different threads access variables that are close
together in memory -- as happens here with the "results" array.

I'm not sure I want to try to explain it (don't have time and might get
it wrong anyway), but a Google search for "cache" and "false sharing"
brings up some useful links.

[ snip ]

>>       for (n = 0; n < 600000000; n++)
>>        if (n % number_of_threads == thread_number)

You could simplify this to

      for (n = number_of_threads; n < 600000000; n+=number_of_threads)

which I'd think would help a little.

[ snip ]

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 06 Feb 2006 00:49 GMT
On Sun, 05 Feb 2006 20:33:26 +0000, Thomas Hawtin
<usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone
who said :

>Some multicore chips share FPUs. Niagra does, but I don't think Athlons
>do. Such code is obviously not going to work so well on shared FPU
>processors.

Is your Athlon a hyperthread or a true dual cpu or something between?

For cpu bound stuff, hyperthreading won't help since you have no true
extra computing horsepower.

We are going to need some new sort of number to measure CPU speed now
that raw clock speed or even aggregate MIPs has become  meaningless.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Knute Johnson - 06 Feb 2006 06:06 GMT
> Hi,
>
[quoted text clipped - 138 lines]
>  }
> }

Saw your post and thought it was interesting.  I too duplicated your
results but I don't have a clue as to what the problem is.  I wrote a
couple of tests of my own to try and got more predictable results.
Although not quite.

The first program below times the calculations as yours did but I think
it is still not optimum for this test.

The second program calculates the number of calculation cycles per ms
which I think is more to the point.  I got very similar results to the
first test though and that is that one thread is definitely slower than
two threads but more than that don't really improve performance.  On my
single processor machine, every increase in threads reduced the number
of calculations that could be performed although again not as
dramatically as I expected with the increase in number of threads.

I tested these on my 1.6Ghz P4 and on a dual Xeon 2.8Ghz.  The Xeons are
dual core and as I added threads I could see the additional cores being
used.  Another interesting thing, if I used one thread the processor
usage showed 25%, two threads 50% and so on.  The P4 is running Windows
XP Home and the Xeon is running 32bit Windows XP Pro.

I do think that the comment about the floating point processor could be
significant too.

Anyway a very interesting post.

public class test2 implements Runnable {
    int numberOfThreads;
    int calculations = 100000000;
    Thread[] thread;

    public test2(String[] args) {
        numberOfThreads = Integer.parseInt(args[0]);
        thread = new Thread[numberOfThreads];

        long then = System.currentTimeMillis();

        for (int i=0; i<numberOfThreads; i++) {
            thread[i] = new Thread(this);
            thread[i].start();
        }

        try {
            for (int i=0; i<numberOfThreads; i++)
                thread[i].join();
        } catch (InterruptedException ie) {
            ie.printStackTrace();
        }

        System.out.println(System.currentTimeMillis() - then);
    }

    public void run() {
        double d;
        int n = calculations / numberOfThreads;
        for (int i=1; i<=n; i++) {
            d = Math.sqrt(i*1.234);
            double t = d / i;
        }
    }

    public static void main(String[] args) {
        new test2(args);
    }
}

import java.util.concurrent.*;

public class test3 implements Runnable {
    volatile long calculations;
    volatile boolean runFlag = true;
    Object o = new Object();
    Semaphore sem;

    public test3(String[] args) {
        int numberOfThreads = Integer.parseInt(args[0]);
        Thread[] thread = new Thread[numberOfThreads];
        sem = new Semaphore(numberOfThreads);
        try {
            sem.acquire(numberOfThreads);

            for (int i=0; i<numberOfThreads; i++) {
                thread[i] = new Thread(this);
                thread[i].start();
            }
            Thread.sleep(2000);  // wait till all threads are created

            long then = System.currentTimeMillis();
            sem.release(numberOfThreads);
            Thread.sleep(20000);
            runFlag = false;
            long now = System.currentTimeMillis();

            System.out.println(calculations/(now-then));
        } catch (InterruptedException ie) {
            ie.printStackTrace();
        }
    }

    public void run() {
        try {
            sem.acquire();
            while (runFlag) {
                double d = Math.sqrt(1234.56789);
                double t = Math.tan(d);
                ++calculations;
            }
        } catch (InterruptedException ie) {
            ie.printStackTrace();
        }
    }

    public static void main(String[] args) {
        new test3(args);
    }
}

Signature

Knute Johnson
email s/nospam/knute/

blmblm@myrealbox.com - 06 Feb 2006 09:43 GMT
[ snip ]

>I wrote a
>couple of tests of my own to try and got more predictable results.
[quoted text clipped - 10 lines]
>of calculations that could be performed although again not as
>dramatically as I expected with the increase in number of threads.

[ snip ]

Interesting second test program (test3 below).  Some questions, though:

Why do you have this?

     Thread.sleep(2000);  // wait till all threads are created

This seems like an ugly hack to avoid writing proper code to wait
until the threads are created?  ugly hacks are not terrible in
quick-and-dirty code, but still.

Also, "calculations" is shared among threads, but you're not ensuring
one-at-a-time access with "synchronized" or some other mechanism.
Why do you think this will work?  You do declare it "volatile",
but as I understand it, this only ensures atomic loads and stores,
while you also have a "++calculations".  I would have said this
was not guaranteed to be atomic on all processors.  No?

>import java.util.concurrent.*;
>
[quoted text clipped - 46 lines]
>     }
>}

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Knute Johnson - 06 Feb 2006 16:46 GMT
> Interesting second test program (test3 below).  Some questions, though:
>
[quoted text clipped - 5 lines]
> until the threads are created?  ugly hacks are not terrible in
> quick-and-dirty code, but still.

Instead of the answer I want to give (which is bite me) I'll tell you
why that is in there.  I wanted to make sure that I wasn't using up a
bunch of time in the thread creation.  Turns out it doesn't really make
any significant difference.  And I was in a hurry and I know it was an
ugly hack and it was simple(ton) :-).

> Also, "calculations" is shared among threads, but you're not ensuring
> one-at-a-time access with "synchronized" or some other mechanism.
> Why do you think this will work?  You do declare it "volatile",
> but as I understand it, this only ensures atomic loads and stores,
> while you also have a "++calculations".  I would have said this
> was not guaranteed to be atomic on all processors.  No?

The Java Language Specification, Third Edition
17.4.4 Synchronization Order
Every execution has a synchronization order. A synchronization
order is a total order over all of the synchronization actions
of an execution. For each thread t, the synchronization order of
the synchronization actions (§17.4.2) in t is consistent with
the program order (§17.4.3) of t.

Synchronization actions induce the synchronized-with relation on
actions, defined as follows:

...
A write to a volatile variable (§8.3.1.4) v synchronizes-with all
subsequent reads of v by any thread (where subsequent is defined
according to the synchronization order).

Looks synchronized to me.

Signature

Knute Johnson
email s/nospam/knute/

Thomas Hawtin - 06 Feb 2006 17:39 GMT
> A write to a volatile variable (§8.3.1.4) v synchronizes-with all
> subsequent reads of v by any thread (where subsequent is defined
> according to the synchronization order).
>
> Looks synchronized to me.

But increment has a read and a separate write. Read the value.
Increment. Write the new value.

Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of
things.

http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.ht
ml#incrementAndGet
()

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Knute Johnson - 06 Feb 2006 21:29 GMT
>> A write to a volatile variable (§8.3.1.4) v synchronizes-with all
>> subsequent reads of v by any thread (where subsequent is defined
[quoted text clipped - 11 lines]
>
> Tom Hawtin

Tom:

Yes, I see what you are saying.  I'll rerun those tests with it
synchronized and see what happens.  If they were in fact passing between
the read and write then the results would show lower performance for
more threads.

Signature

Knute Johnson
email s/nospam/knute/

Knute Johnson - 07 Feb 2006 00:35 GMT
I tried synchronizing the access to the calculations variable but that
made it run very poorly, it was interrupting the calculations.  So I
rewrote it in a different way and discovered some more interesting
things.  First this program is faster even on a single processor machine
when more threads are used.  I don't think that my XP Home can make a
100 threads but if I watch on the dual Xeon machine, it shows a 100
threads on the Task Manager.

I would be curious to know if anybody else duplicates my results.  And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

Threads        Calculations per ms
   1                  1155
   2                  1326
   3                  1367
   4                  1401
  10                  1444

2.8Ghz dual Xeon XP Pro SP2

Threads         Calculations per ms
   1                  2402
   2                  4650
   3                  6141
   4                  7187
  10                  7608
 100                  8224
 200                  8281

import java.util.concurrent.*;

public class test3 implements Runnable {
    public static volatile boolean runFlag = true;
    public long calculations;

    public test3() {
        new Thread(this).start();
    }

    public void run() {
        while (runFlag) {
            double d = Math.sqrt(1234.56789);
            double t = Math.tan(d);
            ++calculations;
        }
    }

    public static void main(String[] args) {
        long totalCalcs = 0;
        int numberOfThreads = Integer.parseInt(args[0]);
        test3[] tests = new test3[numberOfThreads];

        long then = System.currentTimeMillis();
        for (int i=0; i<numberOfThreads; i++) {
            tests[i] = new test3();
        }
        try {
            Thread.sleep(10000);
        } catch (InterruptedException ie) {
            System.out.println(ie);
        }
        tests[0].runFlag = false;
        long now = System.currentTimeMillis();

        for (int i=0; i<numberOfThreads; i++)
            totalCalcs += tests[i].calculations;

        System.out.println(totalCalcs/(now-then));
    }
}

Signature

Knute Johnson
email s/nospam/knute/

Alex Buell - 07 Feb 2006 01:00 GMT
On Mon, 06 Feb 2006 16:35:31 -0800 Knute Johnson
<nospam@ljr-2.frazmtn.com> waved a wand and this message magically
appeared:

> 1.6Ghz XP Home SP2
>
[quoted text clipped - 4 lines]
>     4                  1401
>    10                  1444

1.13Ghz Pentium III, Linux 2.6.15

1    2442
2    2433
3    2430
4    2409
10    2288

Is that useful for you?

Signature

http://www.munted.org.uk

"Honestly, what can I possibly say to get you into my bed?" - Anon.

Knute Johnson - 07 Feb 2006 22:52 GMT
> On Mon, 06 Feb 2006 16:35:31 -0800 Knute Johnson
> <nospam@ljr-2.frazmtn.com> waved a wand and this message magically
[quoted text clipped - 18 lines]
>
> Is that useful for you?

Thanks.  That's more like what I expected.

Signature

Knute Johnson
email s/nospam/knute/

Thomas Hawtin - 07 Feb 2006 02:25 GMT
> I would be curious to know if anybody else duplicates my results.  And
> if they have an idea why it would do more calculations on a single
> processor machine with more threads.
>
> 1.6Ghz XP Home SP2

I can't reproduce that result. On both a 2.26 GHz Celeron D and 266 MHz
PII, running Fedora Core 4. Odd. Perhaps there is another program eating
cycles and Windows allocates more to the program with most threads...

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Philipp Kayser - 07 Feb 2006 09:10 GMT
Hi,

> I would be curious to know if anybody else duplicates my results.  And
> if they have an idea why it would do more calculations on a single
> processor machine with more threads.

I can reproduce the results:

AMD 64 X2 3800+ XP Pro SP2

Threads        Calculations per ms
  1                  3650
  2                  6784
  3                  6780
  4                  6800
 10                  7010

It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe
there are still some conflicts here due to "false-sharing" between some
Thread stacks and/or Thread objects. And when there are more threads,
the probability for cache invalidation is reduced -> more calculations
per second.

Best regards,
Philipp.
Philipp Kayser - 07 Feb 2006 09:25 GMT
Addition to my previous post:

AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU

Threads        Calculations per ms
  1                  3660
  2                  3708
  3                  3717
  4                  3771
 10                  3775

There is also an increase here, but not as big as in your test. Maybe
because of the massive Thread creation: the OS cannot hold the
Sleep(10000), but instead holds 11s or so.

Best regards,
Philipp.
Knute Johnson - 07 Feb 2006 23:02 GMT
> Addition to my previous post:
>
[quoted text clipped - 13 lines]
> Best regards,
> Philipp.

Those are both interesting, thanks.  That AMD looks like it zips right
along too.

Signature

Knute Johnson
email s/nospam/knute/

Luc The Perverse - 07 Feb 2006 11:26 GMT
> Hi,
>
[quoted text clipped - 18 lines]
> the probability for cache invalidation is reduced -> more calculations
> per second.

I will tell you this.   If ANYTHING is being swapped to memory then your two
cores are going to have to share a memory controller/and sharing is always
slower than running alone.

I take it you don't expect this to be happening, but if you are running XP
PRO then windows will force your process into the background to do OS
related stuff, even if just briefly. This could easily account for the
slowing.

--
LTP

:)
Roedy Green - 07 Feb 2006 12:46 GMT
On Tue, 7 Feb 2006 04:26:15 -0700, "Luc The Perverse"
<sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
quoted someone who said :

>I will tell you this.   If ANYTHING is being swapped to memory then your two
>cores are going to have to share a memory controller/and sharing is always
>slower than running alone.

Do this multicore chips, do they  typically share the same SRAM cache
or are they busily worrying about stale copies in each other's caches?

Do they share a port to the SRAM or is the SRAM multiported?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Luc The Perverse - 07 Feb 2006 23:46 GMT
> On Tue, 7 Feb 2006 04:26:15 -0700, "Luc The Perverse"
> <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
[quoted text clipped - 9 lines]
>
> Do they share a port to the SRAM or is the SRAM multiported?

According to a randomly selected product information page they appear to
each have completely seperate L1 and L2 caches.   I think it is truly two
chips on one die, with an integrated memory controller independant of the
two.

Although I will admit that I don't really understand your last sentence.

Signature

LTP

:)
Roedy Green - 08 Feb 2006 07:55 GMT
On Tue, 7 Feb 2006 16:46:25 -0700, "Luc The Perverse"
<sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
quoted someone who said :

>Although I will admit that I don't really understand your last sentence.

In the olden days Univac and Burroughs mainframes had multiple CPUs, a
novelty at the time. You could pay extra for multiported "core" (which
might have been iron cores, plated wire or RAM) so that each CPU could
access core without waiting for the others.  The feature was called
"multiported memory".  I wondered if the equivalent existed today in
the access to SRAM and RAM.

I believe IBM mainframes also had a form of multiported memory so that
I/O channels had a back door into memory that did not block the CPU.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 07 Feb 2006 09:36 GMT
> I would be curious to know if anybody else duplicates my results.  And
> if they have an idea why it would do more calculations on a single
[quoted text clipped - 8 lines]
>     4                  1401
>    10                  1444

Odd.  I find the same thing (I changed your code to create the threads outside
the timed section, but to start() them inside it):

WinXP Pro, sp1. 1.5 GHz uniprocessor.  JDK 1.5.0-b64:
   1    2794
   2    2816
   3    2841
   4    2834
   10  2845

Win2K, sp4.  2.5 GHz uniprocessor.  JDK 1..5.0_05-b05:
   1    2078
   2    2107
   3    2104
   4    2115
   10  2123

On same Win2K box as above but using VMWare to run SuSE 9 + JDK 1.4.2-b28:
   1    1715
   2    1705
   3    1783
   4    1815
   10   2029

Note that the last set of figures show that it cannot be an artefact of Windows
scheduling since Linux processes running inside VMWare are not visible to the
Windows scheduler, nor is memory use, etc.

I don't have an explanation (yet?)  I rather suspect that at least part of it
is to do with when/how the JIT gets around to doing optimisation.  I'll
probably investigate that later.

   -- chris
Chris Uppal - 07 Feb 2006 10:15 GMT
I wrote:

> I don't have an explanation (yet?)  I rather suspect that at least part
> of it is to do with when/how the JIT gets around to doing optimisation.
> I'll probably investigate that later.

Me, back again sooner than I'd planned.

First off, I realised that my modified code accidentally left the old
constructor in, and so started a new thread in each constructor /as well/ as
the ones I was using for testing -- stupid.  Still, it doesn't seem to have
affected the results significantly.

However, putting a loop around the test loop, to allow the JITer more time to
do its thing, produces the following (all on the same 1.5 GHz WinXP box as
before):

Threads  Run1    Run2    ....
1            2757    2906    2910    2919    2907
2            2806    2907    2892    2901    2904
3            2815    2907    2909    2909    2905
4            2826    2902    2908    2903    2913
10          2848    2910    2911    2909    2915

As you'll see, by the third time around the loop, the apparent corellation
between the calculations/sec and the number of threads had vanished into the
noise.  So I think the effect must be purely a question of how early the JITer
kicks in.

   -- chris
Roedy Green - 07 Feb 2006 12:38 GMT
On Tue, 7 Feb 2006 10:15:50 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>As you'll see, by the third time around the loop, the apparent corellation
>between the calculations/sec and the number of threads had vanished into the
>noise.  So I think the effect must be purely a question of how early the JITer
>kicks in.

See http://mindprod.com/jgloss/benchmark.html

I did a warm up "conformance test" partly whose job was to ensure the
jit optimisations had been done before I starting timing anything.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Gordon Beaton - 07 Feb 2006 10:37 GMT
> I would be curious to know if anybody else duplicates my results.
> And if they have an idea why it would do more calculations on a
> single processor machine with more threads.

I don't see that here, these curves are *flat*:

   n       A       B      C
   1     2983    2313    361
   2     2983    2307    356
   3     2966    2338    348
   4     3008    2332    355
   5     2992    2337    351
   6     2989    2327    339
   7     3002    2329    356
   8     2986    2327    358
   9     2983    2336    356
  10     2976    2334    350
 100     2983    2330    358
 
A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

/gordon

Signature

[  do not email me copies of your followups  ]
g o r d o n + n e w s @  b a l d e r 1 3 . s e

Chris Smith - 07 Feb 2006 16:28 GMT
> A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
> B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
> C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0
>
> Interesting to note is that the Pentium manages only 8 times the work,
> despite a nearly 23x "speed" advantage over the Ultra 1!

Not too surprising, though.  Clock speed wars in modern day are
basically marketing nonsense.  Performance improvements come mainly from
cache architectures, pipelining strategies, etc.  Since average computer
users (and even average high-tech gamers) don't know enough to compare
those things, manufacturers keep up the charade by bumping up the clock
speeds, as well.  In truth, the CPU spends most of those blazingly fast
clock cycles waiting on RAM.

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Knute Johnson - 07 Feb 2006 22:57 GMT
>> I would be curious to know if anybody else duplicates my results.
>> And if they have an idea why it would do more calculations on a
[quoted text clipped - 23 lines]
>
> /gordon

I was surprised that my 2.8Ghz Xeon running the test with only one
thread wasn't that much faster than the 1.6Ghz P4.

Signature

Knute Johnson
email s/nospam/knute/

blmblm@myrealbox.com - 07 Feb 2006 10:28 GMT
>> A write to a volatile variable (§8.3.1.4) v synchronizes-with all
>> subsequent reads of v by any thread (where subsequent is defined
[quoted text clipped - 4 lines]
>But increment has a read and a separate write. Read the value.
>Increment. Write the new value.

Exactly the point I was trying to make earlier.  Thanks.

>Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of
>things.
>
>http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.ht
ml#incrementAndGet
()

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
blmblm@myrealbox.com - 07 Feb 2006 10:23 GMT
>> Interesting second test program (test3 below).  Some questions, though:
>>
[quoted text clipped - 8 lines]
>Instead of the answer I want to give (which is bite me) I'll tell you
>why that is in there.  

Well, I was in too much of a hurry to send that post out, and didn't
phrase the question as diplomatically as I might have.  Sorry about
that.  The question I wanted to ask was whether there was some subtle
reason for the apparently ugly hack (one that I wasn't getting),
or whether it was just a quick-and-dirty thing.  NTTAWWT, in some
contexts, and this is one of them.  

>I wanted to make sure that I wasn't using up a
>bunch of time in the thread creation.  

Meaning that you didn't want to start timing until all the threads
were created, right?  Yes, that's what I figured.

>Turns out it doesn't really make
>any significant difference.  And I was in a hurry and I know it was an
>ugly hack and it was simple(ton) :-).

[ snip ]

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Nigel Wade - 06 Feb 2006 11:25 GMT
> Hi,
>
[quoted text clipped - 138 lines]
>  }
> }

I would say it's almost certainly a dirty cache problem due to SMP access to
shared memory.

Just look at what's happening in the tightest part of the loop. When one thread
writes a value in the array the entire cache line containing that particular
element of the array is marked dirty (don't forget that in a SMP system this
affects the cache line in other processors). This means that before any other
thread can either read or write the same cache line that the cache line must be
re-read from main memory. Since each thread is writing to consecutive elements
of the array, they are almost guaranteed to all be writing to the same cache
line.

This is a standard problem in multi-threaded SMP applications. Get each thread
to keep the updating value in thread local storage, and only fill in the array
when the computation is complete. Compare run times.

Signature

Nigel Wade, System Administrator, Space Plasma Physics Group,
           University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw@ion.le.ac.uk
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555

slippymississippi@yahoo.com - 08 Feb 2006 01:01 GMT
I think you've got the wrong idea of the benefits of threads.

Most markedly, threads will reduce your memory usage.  To put it very
simply, a thread is like a lightweight process, so you have reduced
memory overhead when compared to processes.  In other words, it's a lot
less memory intensive to have one application supporting 20 threads
reading input over sockets than configuring 20 applications to read
that data.

As a result of these memory benefits, you get performance benefits as a
byproduct, because the OS has to page swap a lot less often.

I'm coming from a C++ background where I simply created my own threads,
established a mutex for my critical data and/or resources (queue), and
let fly.  So I'm not real fond of Java's screwy implementation of
threads, where every object on the face of the planet has its own
mutex, and Java gives you a lot of rope to hang  yourself waiting for
the mutex availability.  But one thing I can see looking at your code
is that you're using a lot of thread setup to calculate a very simple
result.  The correct analogue is to perform the same task with
processes (does Java even support fork?).

A thread is a very useful tool.  A hammer is a very useful tool, too,
but you wouldn't want to cut a tree down with it.  In the same manner,
you want to use threads for what they're designed: server-type
processes that read data from multiple inputs or perform other
asynchronous tasks with less overhead than processes.
slippymississippi@yahoo.com - 08 Feb 2006 04:07 GMT
Another thing I see in  your code that sends up red flags is
"synchronized(this)" on your thread class ... the correct usage for
mutex protection is something like a library card system.  You have a
central resource (book) that multiple threads (customers) want to use,
so you have a system of tracking when the resource is checked out
(library card).  One person takes the book, the librarian takes the
card from the book.  The next customer in line has to wait for the
librarian to return the card to the book and call "next!"
(notify/notifyAll) before the customer can begin reading.  Because you
want to limit the number of people waiting for books, you assign the
library card to the smallest atomic entity possible (book) rather than
larger objects (shelf, room, floor, library).

So, given my brief experience with Java, I believe you should be using
synchronize on a very small static singleton bean that encapsulates
your data and does very little else ... or simply enforce
synchronization (because you never know how someone else is going to
abuse your classes) by setting the data put/get methods to synchronized
on your bean.  And yes, unless the data is remarkably atomic, you have
to protect your getters as well as your setters, because there's
nothing stopping the OS from swapping out a thread in the middle of a
data update.

In general, the more atomic the data being protected by the mutex, and
the smaller the amount of code updating this data, the more efficient
will be your code.
blmblm@myrealbox.com - 09 Feb 2006 10:22 GMT
>Another thing I see in  your code that sends up red flags is
>"synchronized(this)" on your thread class ...

It's possible that this was being used not so much to guarantee
mutual exclusion as to allow use of wait/notify, which can only
be done from within a synchronized block.  (The OP's use of wait
and notify seemed a little confused to me, but I haven't thought
through the problem in enough detail to be sure.)

>the correct usage for
>mutex protection is something like a library card system.  You have a
[quoted text clipped - 21 lines]
>the smaller the amount of code updating this data, the more efficient
>will be your code.

Some good points here, but it might be as well to keep in mind
in critiquing the OP's code that it appears to have been a toy
program intended to find out how much performance improvement can
be achieved with threads and multiple processors, so some advice
that would be relevant in other contexts might not apply.  

(In fact, I think the OP's program could be rewritten to not use
any explicit synchronization except join() -- create the threads,
start timing, start the threads, wait for them to finish with join(),
combine results, stop timing.  If each thread writes results into
a separate element of an array, no data synchronization is needed.
Some caution is needed, though, to avoid "false sharing" of cache.)

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
blmblm@myrealbox.com - 09 Feb 2006 10:09 GMT
>I think you've got the wrong idea of the benefits of threads.
>
[quoted text clipped - 14 lines]
>mutex, and Java gives you a lot of rope to hang  yourself waiting for
>the mutex availability.  

Maybe it would make more sense if you regarded it not as a "screwy
implementation of threads" but as an implementation of (some of) the
"monitors" synchronization mechanism first proposed (AFAICT) by Per
Brinch Hansen and C.A.R. Hoare in the 1970s.

One thing Java gives you that thread libraries may not (I'm basing
this on imperfect recollections of using the Pthreads library) is a
reasonable mechanism for doing kinds of synchronization other than
simple mutual exclusion -- e.g., what you would need for a simple
producer/consumer setup (in which one thread produces data for
another to consume, and you would like the consumer thread to wait
until there is data to consume, preferably without busy-waiting).

>But one thing I can see looking at your code
>is that you're using a lot of thread setup to calculate a very simple
>result.  The correct analogue is to perform the same task with
>processes (does Java even support fork?).

Are you saying that, if the goal is improve performance by splitting
computation among processors, this is better done with processes
than threads?  Why?  In some ways processes communicating by
message-passing is a less error-prone way to "parallelize" code
(exploit multiple processors) than threads communicating via shared
memory, and of course if you don't *have* shared memory it's the only
choice.  But if you can do either one and want maximum performance,
it seems to me that threads are the way to go.  Do you disagree?

>A thread is a very useful tool.  A hammer is a very useful tool, too,
>but you wouldn't want to cut a tree down with it.  In the same manner,
>you want to use threads for what they're designed: server-type
>processes that read data from multiple inputs or perform other
>asynchronous tasks with less overhead than processes.

Or to improve performance if you have more than one processor.  (Or
even if you don't, and your code can be split up into threads in a way
that allows one thread to make progress while another thread waits
for something to happen, I/O for example -- i.e., if threads can be
used to "mask latency".)

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Dimitri Maziuk - 09 Feb 2006 17:56 GMT
blmblm@myrealbox.com sez:

> Are you saying that, if the goal is improve performance by splitting
> computation among processors, this is better done with processes
> than threads?  Why?

Threads (as in "lightweight processes") were originally introduced
to avoid the overhead of process creation. Threads are not the
only solution: Plan 9 folks came up with low-overhead process
creation mechanism, Linux picked it up quickly. In a modern OS
with low-cost process creation mechanism the benefits of threads
(vs. processes) are small, and the overhead is an extra layer of
complexity.

This is a moot point in Java, of course, since Java only has
threads. There Is No Fork().

...  But if you can do either one and want maximum performance,
> it seems to me that threads are the way to go.  Do you disagree?

Two main sources of overhead are initial creation and
synchronization. If they are fully independent (no
synchronization) and long-running (creation cost amortized
to close to 0 over runtime), there should be no difference
between threads and processes performance-wise.

>>you want to use threads for what they're designed: server-type
>>processes that read data from multiple inputs or perform other
[quoted text clipped - 5 lines]
> for something to happen, I/O for example -- i.e., if threads can be
> used to "mask latency".)

Only if you have several execution streams that can run
asynchronously.

Anyway, the real deal is that threads don't economically scale
past 4 CPUs (4-way being the cusp of bang-per-buck curve) and
thus 4-16 threads: e.g. Sun will happily sell you a 24-CPU
cupboard for a cool million bucks, or 24 1-CPU pizza boxen for
about $18,000. Which means that if you're doing it for scalability,
you're way better off going fully distributed (hence, independent
processes) or with some specialized grid/cluster setup (probably
different programming techniques altogether).

Dima
Signature

We're sysadmins. Sanity happens to other people.                  -- Chris King

blmblm@myrealbox.com - 11 Feb 2006 18:51 GMT
>blmblm@myrealbox.com sez:
>>
[quoted text clipped - 9 lines]
>(vs. processes) are small, and the overhead is an extra layer of
>complexity.

Threads introduce an extra layer of complexity?  Hm.  I'm not sure
what you mean there ....:

I can believe that in current operating systems it's almost as fast
to create processes as to create threads.  Context switch times
might also need to be taken into account, but maybe there too the
difference is not really significant.

But my understanding is that a key difference between threads and
processes is that the former share an address space and the latter
do not, and this leads to rather different programming models --
communication via message-passing versus communication via shared
variables.  Shared variables in some ways seem easier and more
natural, but the need for synchronization means that there are lots
of potential pitfalls.  Message passing is less easy/natural but
avoid some of the potential pitfalls.

All of this is of course moot if the threads/processes are totally
independent.  I'm coming to this discussion from a background in
parallel computing, though, where the threads/processes are rarely
*completely* independent.

>This is a moot point in Java, of course, since Java only has
>threads. There Is No Fork().

Um, no, but doesn't the "exec" method in Runtime do something along
those lines?  I have zero experience working with this method; I'm
going by the API and by vague recollections of discussions in this
group.

>...  But if you can do either one and want maximum performance,
>> it seems to me that threads are the way to go.  Do you disagree?
[quoted text clipped - 4 lines]
>to close to 0 over runtime), there should be no difference
>between threads and processes performance-wise.

Probably so.  I wasn't really thinking of totally independent
threads/processes.

Also, if you're using the additional threads/processes to mask
latency, then context switch times need to be taken into account.
Maybe there too the difference is less than I think -- threads
would seem to win out, but perhaps not by much.

>>>you want to use threads for what they're designed: server-type
>>>processes that read data from multiple inputs or perform other
[quoted text clipped - 8 lines]
>Only if you have several execution streams that can run
>asynchronously.

Sure.  I do enough of this multithreading-for-performance (as a
research interest more than for practical reasons) that I'm apt to
forget that not everyone thinks in those terms, maybe.

>Anyway, the real deal is that threads don't economically scale
>past 4 CPUs (4-way being the cusp of bang-per-buck curve) and
[quoted text clipped - 4 lines]
>processes) or with some specialized grid/cluster setup (probably
>different programming techniques altogether).

Traditionally this has been true, yes.  I wonder if it will continue
to be true in the next few years -- dual-core chips are not the
rare beasts they were a few years ago, and one hears so much about
multicore chips in general that I wonder whether in a few years
there won't be a lot more 4-way and 8-way systems ....  But I'm
not a hardware designer, and it may be that there's still no way
to make shared-memory systems scalable.

In any case, it seems to me that arguments about scalability, while
interesting and important, miss the point that if what you have is
a 2-processor or 4-processor system, a program that exploits the
extra processors is a good thing to have, even if it doesn't scale
to more than 4 processors.

Just some thoughts / my two cents' worth.

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 12 Feb 2006 05:33 GMT
>I can believe that in current operating systems it's almost as fast
>to create processes as to create threads.

nowhere near as fast, especially if you start several threads using
the same classes. The classes are already loaded so there is little to
do other than allocate some ram for a stack.

With an external process you must load a partly compressed exe file,
and relocate all the addresses in it, and hook up all the DLLs, then
you can think about loading the DLLs.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

blmblm@myrealbox.com - 12 Feb 2006 14:45 GMT
>>I can believe that in current operating systems it's almost as fast
>>to create processes as to create threads.
>
>nowhere near as fast, especially if you start several threads using
>the same classes. The classes are already loaded so there is little to
>do other than allocate some ram for a stack.

Well, what you're saying here is that *for Java programs* it's
a lot faster to start a new thread than a new process, right?
which is more relevant in this newsgroup than a general discussion
of processes versus threads -- but in the post you're replying to,
the discussion had (I thought) become more general.

>With an external process you must load a partly compressed exe file,
>and relocate all the addresses in it, and hook up all the DLLs, then
>you can think about loading the DLLs.

The terminology here is somewhat Windows-specific, but ....  Okay,
okay, similar ideas (exe files and DLLs) exist in other operating
systems.  I'm a little skeptical that everything you describe has
to be done, however:

>load a partly compressed exe file,

I was under the impression that "smart" operating systems can allow
multiple processes running the same code to share a copy in memory.
But maybe that only works if the processes do something special?

>and relocate all the addresses in it,

Relocate addresses?  I thought this was something that had been
necessary in the not-so-good old days, before the advent of dynamic
address translation, but didn't need to be done now?  I may be
misunderstanding what you're saying though, or more confused than
I think about how things work under the hood.

>and hook up all the DLLs, then
>you can think about loading the DLLs.

Which ....  If you're starting a new process to execute the same
code as an existing process, that shouldn't be necessary, right?
Shouldn't the relevant DLLs already be loaded?  Do I not understand
how DLLs work?  (I would have thought that two processes using the
same one could share an in-memory copy.)  And if it's executing new
code, then you have the same issues with starting a new thread, no?

I'm not really sure about all of this, but notice that the person
to whom I replied said this:

>>Threads (as in "lightweight processes") were originally introduced
>>to avoid the overhead of process creation. Threads are not the
[quoted text clipped - 3 lines]
>>(vs. processes) are small, and the overhead is an extra layer of
>>complexity.

So it may depend on the operating system?

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 12 Feb 2006 16:27 GMT
>The terminology here is somewhat Windows-specific, but ....  Okay,
>okay, similar ideas (exe files and DLLs) exist in other operating
>systems.  I'm a little skeptical that everything you describe has
>to be done, however:

IIRC one of the old DEC oses RSTS-11?  allowed you to pre-relocate
programs for fast load. It then only had to load them without
processing.

IIRC the ancient Multics OS just memory mapped your program into the
address space to start it.  Instant on.

Univac-1100 has hardware relocation so load was trivial too.

IBM 360 and Windows made program load into a great production.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 12 Feb 2006 16:29 GMT
>I was under the impression that "smart" operating systems can allow
>multiple processes running the same code to share a copy in memory.
>But maybe that only works if the processes do something special?

With windows  DLLs  get shared. I don't know about ordinary code. It
would have to load at the same spot each person's address space to be
shared.  Seems to me with NT they made a partial attempt to ensure
that happens.


Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 12 Feb 2006 16:31 GMT
>Relocate addresses?  I thought this was something that had been
>necessary in the not-so-good old days, before the advent of dynamic
>address translation, but didn't need to be done now?  I may be
>misunderstanding what you're saying though, or more confused than
>I think about how things work under the hood.

loot at the exe header. All the relocation information is still there.
I think what they do is set them up so that IF the program loads at a
certain location, there is less fixup to do.

Even then there are a zillion hooks to dlls etc to arrange.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

blmblm@myrealbox.com - 13 Feb 2006 10:30 GMT
>>Relocate addresses?  I thought this was something that had been
>>necessary in the not-so-good old days, before the advent of dynamic
[quoted text clipped - 5 lines]
>I think what they do is set them up so that IF the program loads at a
>certain location, there is less fixup to do.

Seems like with every process having its own address space it usually
*would* be the case that the program loads at the same location (in
the address space) every time?

(Good suggestion about looking at the format of executable files.
Something else I really must do sometime.)

>Even then there are a zillion hooks to dlls etc to arrange.

Yeah ....

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 12 Feb 2006 16:32 GMT
>Which ....  If you're starting a new process to execute the same
>code as an existing process, that shouldn't be necessary, right?
>Shouldn't the relevant DLLs already be loaded?

the DLLS may be loaded, but the executable contains only  symbolic
links into the DLL, those all have to be resolved and patched.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

blmblm@myrealbox.com - 13 Feb 2006 10:34 GMT
>>Which ....  If you're starting a new process to execute the same
>>code as an existing process, that shouldn't be necessary, right?
>>Shouldn't the relevant DLLs already be loaded?
>
>the DLLS may be loaded, but the executable contains only  symbolic
>links into the DLL, those all have to be resolved and patched.

Well, you did say, in the post to which I was responding:

>>>then you can think about loading the DLLs.

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 12 Feb 2006 16:33 GMT
> Do I not understand
>how DLLs work?  (I would have thought that two processes using the
>same one could share an in-memory copy.)  And if it's executing new
>code, then you have the same issues with starting a new thread, no?

DLLs used to share code and memory. Now they just share code.
You will notice how java.exe is more spritely if you have recently
used it. That is because its DLLs are still hanging around in RAM and
don't have to be reloaded.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

blmblm@myrealbox.com - 13 Feb 2006 10:40 GMT
>> Do I not understand
>>how DLLs work?  (I would have thought that two processes using the
[quoted text clipped - 4 lines]
>You will notice how java.exe is more spritely if you have recently
>used it.

The name of the executable I use is java, not java.exe.  (Said with
sort of a :-), because from another of your posts it's apparent
that you do know, at least as well as I do, that there are many
operating systems.)

>That is because its DLLs are still hanging around in RAM and
>don't have to be reloaded.

But, but, didn't you just say DLLs just share code, not memory ....

| B. L. Massingill
| ObDisclaimer:  I don't speak for my employers; they return the favor.
Roedy Green - 13 Feb 2006 14:37 GMT
>But, but, didn't you just say DLLs just share code, not memory ....

In the old days there was no easy mechanism for each user of a dll to
have his own instance of the data. The DLL had to manage its pool of
users allocating each some workspace from its own pool.  Today
separate instances in the default  I believe each client of the DLL
addresses its data at the same virtual address. Presumably there are
features to share read-only or read-write areas of RAM too. Those are
common OS features.

I don't know Just how clever has Sun been about avoiding literally
loading all the standard class files, jitting them and optimising them
on every java.exe launch.   I would hope they have some scheme to
prebuild the standard class set so that the digested classes looks to
the OS like read-only code, or a hunk of memory mapped data, or a hunk
or read-only memory mapped data or ...

Back in 1985 I invented a scheme called Gespenstering for capturing
ram images of a program in flight and turning them into a relocatable
executable snapshot.  That saved a huge amount of complicated
initialisation, similar to loading class files,  in my Forth/Abundance
language.   I did this for DOS. Presumably you could do something
similar for Windows.  I discuss this at
http://mindprod.com/projects/gespenster.html

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Gordon Beaton - 12 Feb 2006 15:35 GMT
> nowhere near as fast, especially if you start several threads using
> the same classes. The classes are already loaded so there is little
> to do other than allocate some ram for a stack.

On Linux, the kernel makes no distinction between threads and
processes, so they are identically fast to create and schedule. This
has nothing to do with classes.

> With an external process you must load a partly compressed exe file,
> and relocate all the addresses in it, and hook up all the DLLs, then
> you can think about loading the DLLs.

On Unix (and Linux), common shared libraries (most notably libc) will
already be loaded before the process starts, so they don't need to be
loaded again for every process. Also the kernel normally caches
recently used code, so it is faster to load the same program a second
time.

/gordon

Signature

[  do not email me copies of your followups  ]
g o r d o n + n e w s @  b a l d e r 1 3 . s e

Roedy Green - 12 Feb 2006 16:34 GMT
>On Linux, the kernel makes no distinction between threads and
>processes, so they are identically fast to create and schedule. This
>has nothing to do with classes.

Perhaps we are using a term differently.  by Thread I presume shared
address space. By process I presume a new address space, e.g. via
exec.

What do you mean by the terms?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Gordon Beaton - 12 Feb 2006 17:06 GMT
> Perhaps we are using a term differently. by Thread I presume shared
> address space. By process I presume a new address space, e.g. via
> exec.
>
> What do you mean by the terms?

On Linux, the kernel only knows about "tasks". They can share none,
part, or all of their address space with one another. When they don't
share any address space, they look like traditional processes. When
they share all of their address space, they look like threads (even
though they each task has its own pid). But the kernel does not treat
them any differently, so it cannot be claimed that one is faster or
cheaper to create than the other.

It is this lack of distinction that previously confused many people
who did not understand why "ps" often showed so multiple Java
processes on Linux.

/gordon

Signature

[  do not email me copies of your followups  ]
g o r d o n + n e w s @  b a l d e r 1 3 . s e

Roedy Green - 13 Feb 2006 09:56 GMT
>On Linux, the kernel only knows about "tasks". They can share none,
>part, or all of their address space with one another. When they don't
[quoted text clipped - 7 lines]
>who did not understand why "ps" often showed so multiple Java
>processes on Linux.

I presume it tracks task groups so you can kill all related tasks at
once.

But still it takes a heck of a lot longer to start a task from a raw
executable than from code already in RAM. The task may look the same
after they are born, but the threads have a much easier birth. Perhaps
Linux conceptually  separates out loading from starting the task. If
that is so then indeed tasks and threads could look identical, even in
the way they are started.

What we should be aiming for is MOST of the time, to start a program
all you do is map its code into virtual memory and start executing,
causing an immediate page fault.  If something happens to disturb the
memory maps, then they should be automatically rebuilt, perhaps in the
background when the system in idle, not built from scratch on every
execution. You could get very close to the idea that programs never
stop, just sleep.

see http://mindprod.com/projects/gespe