Java Forum / General / February 2006
Multithreading / Scalability
Philipp Kayser - 05 Feb 2006 18:28 GMT Hi,
I wrote a small program to test scalability in a multiprocessor environment (in my case an Athlon 64 X2). I included the source below.
To my surprise the calculation does not run faster if I use 2 threads (which should be the case if I have 2 processors) but it runs 5 times slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i use 1 thread, I have a total CPU usage of 50%, if use two threads I get 100%. The best thing is: if I limit the JVM to one processor by setting the CPU affinity for the process, but again take 2 threads, the calculation runs only 2 times slower (6s instead of 3s for one thread).
My current current diagnosis is that it may have something to do with the CPU cache. I searched the Internet for similar problems and found the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors always switch between the two threads and by doing so their CPU cache gets flushed every time.
Does anyone have an idea?
Best regards, Philipp Kayser.
public class Test { private final int number_of_threads = 2; private static int number_of_finished_threads; private Thread threads[] = new Thread[number_of_threads]; double result[] = new double[number_of_threads]; private Thread main_thread;
private class CalculationThread extends Thread { int thread_number;
CalculationThread(int n) { super(); thread_number = n; }
public void run() { try { synchronized (this) { while (true) { wait();
int n = 0;
for (n = 0; n < 600000000; n++) if (n % number_of_threads == thread_number) result[thread_number] += Math.sqrt(n);
synchronized (main_thread) { number_of_finished_threads++; main_thread.notify(); } } } } catch (InterruptedException e) { } } }
private void multithreaded_calculation() { synchronized (main_thread) { number_of_finished_threads = 0;
int i; for (i = 0; i < number_of_threads; i++) { synchronized (threads[i]) { threads[i].notify(); } }
do { try { main_thread.wait(); } catch (InterruptedException e) { } } while (number_of_finished_threads < number_of_threads);
double total_result = 0; for (i = 0; i < number_of_threads; i++) total_result += result[i];
System.out.println(total_result); } }
private void test() { main_thread = Thread.currentThread();
int i; for (i = 0; i < number_of_threads; i++) { threads[i] = new CalculationThread(i); threads[i].setPriority(Thread.NORM_PRIORITY); threads[i].setDaemon(true); threads[i].start(); }
try { Thread.sleep(1000); } catch (InterruptedException e) { }
long t0 = new Date().getTime();
multithreaded_calculation();
long t1 = new Date().getTime();
System.out.println(((double)t1 - t0)/1000); }
public static void main(String[] args) { new Test().test(); } }
Thomas Hawtin - 05 Feb 2006 20:20 GMT > I wrote a small program to test scalability in a multiprocessor > environment (in my case an Athlon 64 X2). I included the source below. [quoted text clipped - 7 lines] > CPU affinity for the process, but again take 2 threads, the calculation > runs only 2 times slower (6s instead of 3s for one thread).
> My current current diagnosis is that it may have something to do with > the CPU cache. I searched the Internet for similar problems and found > the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors > always switch between the two threads and by doing so their CPU cache > gets flushed every time. Highly unlikely. Your program actively uses very little memory, so it isn't going to be a problem with cache capacity.
> private class CalculationThread extends Thread It is rarely a good idea to override Thread. There is no need for inheritance.
> public void run() > { > try > { > synchronized (this) > { Using a complex object like Thread as a lock can be a bad idea. Perhaps the class itself uses it as a lock. In the case of Thread, Sun JRE does.
> while (true) > { > wait(); When they say always put wait in a while loop, don't do it like this. You are not guaranteed that the wait wont wake up early. Check a condition in an enclosing while loop.
> int n = 0; > > for (n = 0; n < 600000000; n++) > if (n % number_of_threads == thread_number) > result[thread_number] += Math.sqrt(n); Although you would expect the square root to dominate, this loop is simpler if number_of_threads is one. Exactly how it will get optimised, I have no idea.
Some multicore chips share FPUs. Niagra does, but I don't think Athlons do. Such code is obviously not going to work so well on shared FPU processors.
> threads[i].setDaemon(true); Making the threads daemon only masks your bug above, where threads do not exit.
> threads[i].start(); > } [quoted text clipped - 6 lines] > { > } Looks odd. I guess you don't want to include the thread start up time. The other issue is that you have a potential deadlock.
> long t0 = new Date().getTime(); System.currentTimeMillis() is more conventional. Or System.nanoTime() from 1.5.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Philipp Kayser - 05 Feb 2006 21:46 GMT Hi Thomas,
thanks for your response.
>> private class CalculationThread extends Thread > It is rarely a good idea to override Thread. There is no need for > inheritance. > ... > Using a complex object like Thread as a lock can be a bad idea. Perhaps > the class itself uses it as a lock. In the case of Thread, Sun JRE does. I changed class CalculationThread to be of type Runnable and also changed main_thread to be of type Object. So i removed all locks on Thread objects. Same result as before.
>> for (n = 0; n < 600000000; n++) >> if (n % number_of_threads == thread_number) >> result[thread_number] += Math.sqrt(n); > Although you would expect the square root to dominate, this loop is > simpler if number_of_threads is one. Exactly how it will get optimised, > I have no idea. I also considered this. I changed the loop to:
for (n = 0; n < 30000; n++) if (n % number_of_threads == thread_number) { double a = 2; for (int p = 0; p < n; p++) { a += Math.sqrt(a); result[thread_number] += a; } }
I dont think the inner loop can be optimized. The result changed a bit:
2 CPUs / 1 Threads : 8.75s 2 CPUs / 2 Threads : 9.35s
Although the result is better than before, 2 CPUs are again slower than 1 CPU. I would expect 2 CPUs too be nearly 2x faster than 1 CPU in such a simple algorithm.
Best regards, Philipp Kayser.
Daniel Dyer - 05 Feb 2006 22:12 GMT > I dont think the inner loop can be optimized. The result changed a bit: > [quoted text clipped - 7 lines] > Best regards, > Philipp Kayser. Are you running your app with the -server option? Is there any significant difference between performance with the server and client VMs?
Dan.
 Signature Daniel Dyer http://www.dandyer.co.uk
Philipp Kayser - 05 Feb 2006 22:56 GMT Hi,
> Are you running your app with the -server option? Is there any > significant difference between performance with the server and client VMs? I think there is no server JVM for Windows. The "-server" option gives me an error.
Best regards, Philipp.
Daniel Dyer - 05 Feb 2006 23:51 GMT > Hi, > [quoted text clipped - 3 lines] > I think there is no server JVM for Windows. The "-server" option gives > me an error. For some reason the server VM is only included with the JDK, not the JRE. So if you use the JRE java.exe it will give you an error with "-server".
Dan.
 Signature Daniel Dyer http://www.dandyer.co.uk
Philipp Kayser - 06 Feb 2006 20:28 GMT Hi,
> For some reason the server VM is only included with the JDK, not the > JRE. So if you use the JRE java.exe it will give you an error with > "-server". Ok, this works. But I noticed no better performance for the "false-sharing" problem. However, the server JVM seems to be significant faster (about 25%) than the client JVM. Thanks for this hint.
Best regards, Philipp.
Roedy Green - 06 Feb 2006 00:51 GMT >I think there is no server JVM for Windows. The "-server" option gives >me an error. You have to use the java.exe in the JDK not the JRE. It understands -server. I have been using it for the great signum bakeoff.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Philipp Kayser - 05 Feb 2006 22:14 GMT Hi again,
I think I found the problem. I changed the loop again to:
for (n = 0; n < 10000; n++) if (n % number_of_threads == thread_number) { double a = 2; for (int p = 0; p < n; p++) { a += Math.sin(a); } result[thread_number] += a; }
Now I get satisfying resuls:
2 CPUs / 1 Thread : 10.7s 2 CPUs / 2 Threads : 5.469s
I think the problem is the read/write-access to the result-array. If the result-array is changed on one CPU, the cache on the second CPU gets invalid. By pulling out the result-assignment of the inner loop, the amount of cache invalidations are reduced greatly.
Best regards, Philipp Kayser.
blmblm@myrealbox.com - 05 Feb 2006 22:24 GMT >Hi again, > [quoted text clipped - 20 lines] >invalid. By pulling out the result-assignment of the inner loop, the >amount of cache invalidations are reduced greatly. "False sharing"! wish I'd noticed this post before composing my reply of a few minutes ago.
But the comment (in the other post) about improving the loop might still be worthwhile.
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. blmblm@myrealbox.com - 06 Feb 2006 09:25 GMT >>Hi again, >> [quoted text clipped - 23 lines] >"False sharing"! wish I'd noticed this post before composing my reply >of a few minutes ago. Following up a little, after doing some experiments on a four-processor machine at work ....:
My idea for getting rid of the "false sharing" cache problem was to do the original calculation summing into a local variable, and then add into result[thread_number] at the end. That should produce similar results to what you're doing above ....
And it does help -- performance changes from "the more threads, the slower the program" to reasonable speedups with 2 and 4 threads, compared to 1.
But ....
>But the comment (in the other post) about improving the loop might >still be worthwhile. By accident I discovered that making this change to the original calculation, *instead of* making the switch to summing into a local variable *actually produces a faster program*, with good speedups for 2 and 4 threads.
I don't understand this at all. Maybe I've made some stupid blunder. Otherwise -- hm, I don't know!
Below is my modified version of your code. I did make some other changes -- simplified to have the main thread use "join" to wait for the calculation threads (did you know you could do that?) and to move all thread activity into the timed part of the code so I don't have to do the slightly complicated stuff to wait until all threads are started before starting the timed part ...
public class Test { private static int number_of_threads; private Thread threads[] = new Thread[number_of_threads]; double result[] = new double[number_of_threads];
private class CalculationThread implements Runnable { int thread_number;
CalculationThread(int n) { thread_number = n; }
public void run() { /* double local_result = 0.0; for (int n = 0; n < 600000000; n++) if (n % number_of_threads == thread_number) local_result += Math.sqrt(n); result[thread_number] += local_result; */ // this actually is faster! ???? for (int n = thread_number; n < 600000000; n+=number_of_threads) result[thread_number] += Math.sqrt(n); } }
private void multithreaded_calculation() { for (int i = 0; i < number_of_threads; i++) { threads[i] = new Thread(new CalculationThread(i)); //threads[i].setPriority(Thread.NORM_PRIORITY); //threads[i].setDaemon(true); threads[i].start(); }
try { for (int i = 0; i < number_of_threads; i++) threads[i].join(); } catch (InterruptedException e) { }
double total_result = 0; for (int i = 0; i < number_of_threads; i++) total_result += result[i];
System.out.println(total_result); }
private void test() { long t0 = System.currentTimeMillis();
multithreaded_calculation();
long t1 = System.currentTimeMillis();
System.out.println(((double)t1 - t0)/1000); }
public static void main(String[] args) { number_of_threads = Integer.parseInt(args[0]); new Test().test(); } }
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Philipp Kayser - 06 Feb 2006 10:13 GMT Hi,
> My idea for getting rid of the "false sharing" cache problem was > to do the original calculation summing into a local variable, [quoted text clipped - 4 lines] > slower the program" to reasonable speedups with 2 and 4 threads, > compared to 1. Yes, I can duplicate this results. The two Thread stacks (where local variables are being held) seem to be far enough away from each other to avoid the "false sharing"-problem.
> But .... > By accident I discovered that making this change to the original [quoted text clipped - 3 lines] > I don't understand this at all. Maybe I've made some stupid > blunder. Otherwise -- hm, I don't know! I can also duplicate this result. And I also don't have an explanation for it. Maybe it has to do with code optimization because of the now simpler code: the JVM holds "result[thread_number]" in a machine register and later writes the value back to memory when the calculation has finished.
> Below is my modified version of your code. I did make some other > changes -- simplified to have the main thread use "join" to wait > for the calculation threads (did you know you could do that?)... My intention was to avoid unnecessary thread creations. In this case it is no problem because the threads are only started once. But I am currently working on parallelising a bigger program, where this "multithreaded_calculation" is started over and over again. But maybe thread creation is not a big topic for an OS so this optimization is not really needed. To wait 1s for the threads to reach the wait() is very hacky I know, but I found no better solution (except of the join-solution).
Best regards, Philipp.
blmblm@myrealbox.com - 06 Feb 2006 11:11 GMT [ snip ]
>> But .... >> By accident I discovered that making this change to the original [quoted text clipped - 8 lines] >register and later writes the value back to memory when the calculation >has finished. That's a very plausible explanation -- it's maybe a little puzzling that this optimization wouldn't have also been done with the original version, but .... <shrug>.
>> Below is my modified version of your code. I did make some other >> changes -- simplified to have the main thread use "join" to wait [quoted text clipped - 7 lines] >To wait 1s for the threads to reach the wait() is very hacky I know, >but I found no better solution (except of the join-solution). There's some nice stuff in the java.util.concurrent package, added in Java 1.5 (5.0), for creating "thread pools", which sounds like what you want.
Doing some quick Googling, here's a tutorial that seems reasonable, with some examples:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/concurr encyTools.html
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. blmblm@myrealbox.com - 05 Feb 2006 22:22 GMT >> I wrote a small program to test scalability in a multiprocessor >> environment (in my case an Athlon 64 X2). I included the source below. [quoted text clipped - 16 lines] >Highly unlikely. Your program actively uses very little memory, so it >isn't going to be a problem with cache capacity. But there might be a problem with something called "false sharing", which happens when different threads access variables that are close together in memory -- as happens here with the "results" array.
I'm not sure I want to try to explain it (don't have time and might get it wrong anyway), but a Google search for "cache" and "false sharing" brings up some useful links.
[ snip ]
>> for (n = 0; n < 600000000; n++) >> if (n % number_of_threads == thread_number) You could simplify this to
for (n = number_of_threads; n < 600000000; n+=number_of_threads)
which I'd think would help a little.
[ snip ]
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 06 Feb 2006 00:49 GMT On Sun, 05 Feb 2006 20:33:26 +0000, Thomas Hawtin <usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone who said :
>Some multicore chips share FPUs. Niagra does, but I don't think Athlons >do. Such code is obviously not going to work so well on shared FPU >processors. Is your Athlon a hyperthread or a true dual cpu or something between?
For cpu bound stuff, hyperthreading won't help since you have no true extra computing horsepower.
We are going to need some new sort of number to measure CPU speed now that raw clock speed or even aggregate MIPs has become meaningless.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Knute Johnson - 06 Feb 2006 06:06 GMT > Hi, > [quoted text clipped - 138 lines] > } > } Saw your post and thought it was interesting. I too duplicated your results but I don't have a clue as to what the problem is. I wrote a couple of tests of my own to try and got more predictable results. Although not quite.
The first program below times the calculations as yours did but I think it is still not optimum for this test.
The second program calculates the number of calculation cycles per ms which I think is more to the point. I got very similar results to the first test though and that is that one thread is definitely slower than two threads but more than that don't really improve performance. On my single processor machine, every increase in threads reduced the number of calculations that could be performed although again not as dramatically as I expected with the increase in number of threads.
I tested these on my 1.6Ghz P4 and on a dual Xeon 2.8Ghz. The Xeons are dual core and as I added threads I could see the additional cores being used. Another interesting thing, if I used one thread the processor usage showed 25%, two threads 50% and so on. The P4 is running Windows XP Home and the Xeon is running 32bit Windows XP Pro.
I do think that the comment about the floating point processor could be significant too.
Anyway a very interesting post.
public class test2 implements Runnable { int numberOfThreads; int calculations = 100000000; Thread[] thread;
public test2(String[] args) { numberOfThreads = Integer.parseInt(args[0]); thread = new Thread[numberOfThreads];
long then = System.currentTimeMillis();
for (int i=0; i<numberOfThreads; i++) { thread[i] = new Thread(this); thread[i].start(); }
try { for (int i=0; i<numberOfThreads; i++) thread[i].join(); } catch (InterruptedException ie) { ie.printStackTrace(); }
System.out.println(System.currentTimeMillis() - then); }
public void run() { double d; int n = calculations / numberOfThreads; for (int i=1; i<=n; i++) { d = Math.sqrt(i*1.234); double t = d / i; } }
public static void main(String[] args) { new test2(args); } }
import java.util.concurrent.*;
public class test3 implements Runnable { volatile long calculations; volatile boolean runFlag = true; Object o = new Object(); Semaphore sem;
public test3(String[] args) { int numberOfThreads = Integer.parseInt(args[0]); Thread[] thread = new Thread[numberOfThreads]; sem = new Semaphore(numberOfThreads); try { sem.acquire(numberOfThreads);
for (int i=0; i<numberOfThreads; i++) { thread[i] = new Thread(this); thread[i].start(); } Thread.sleep(2000); // wait till all threads are created
long then = System.currentTimeMillis(); sem.release(numberOfThreads); Thread.sleep(20000); runFlag = false; long now = System.currentTimeMillis();
System.out.println(calculations/(now-then)); } catch (InterruptedException ie) { ie.printStackTrace(); } }
public void run() { try { sem.acquire(); while (runFlag) { double d = Math.sqrt(1234.56789); double t = Math.tan(d); ++calculations; } } catch (InterruptedException ie) { ie.printStackTrace(); } }
public static void main(String[] args) { new test3(args); } }
 Signature Knute Johnson email s/nospam/knute/
blmblm@myrealbox.com - 06 Feb 2006 09:43 GMT [ snip ]
>I wrote a >couple of tests of my own to try and got more predictable results. [quoted text clipped - 10 lines] >of calculations that could be performed although again not as >dramatically as I expected with the increase in number of threads. [ snip ]
Interesting second test program (test3 below). Some questions, though:
Why do you have this?
Thread.sleep(2000); // wait till all threads are created
This seems like an ugly hack to avoid writing proper code to wait until the threads are created? ugly hacks are not terrible in quick-and-dirty code, but still.
Also, "calculations" is shared among threads, but you're not ensuring one-at-a-time access with "synchronized" or some other mechanism. Why do you think this will work? You do declare it "volatile", but as I understand it, this only ensures atomic loads and stores, while you also have a "++calculations". I would have said this was not guaranteed to be atomic on all processors. No?
>import java.util.concurrent.*; > [quoted text clipped - 46 lines] > } >}
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Knute Johnson - 06 Feb 2006 16:46 GMT > Interesting second test program (test3 below). Some questions, though: > [quoted text clipped - 5 lines] > until the threads are created? ugly hacks are not terrible in > quick-and-dirty code, but still. Instead of the answer I want to give (which is bite me) I'll tell you why that is in there. I wanted to make sure that I wasn't using up a bunch of time in the thread creation. Turns out it doesn't really make any significant difference. And I was in a hurry and I know it was an ugly hack and it was simple(ton) :-).
> Also, "calculations" is shared among threads, but you're not ensuring > one-at-a-time access with "synchronized" or some other mechanism. > Why do you think this will work? You do declare it "volatile", > but as I understand it, this only ensures atomic loads and stores, > while you also have a "++calculations". I would have said this > was not guaranteed to be atomic on all processors. No? The Java Language Specification, Third Edition 17.4.4 Synchronization Order Every execution has a synchronization order. A synchronization order is a total order over all of the synchronization actions of an execution. For each thread t, the synchronization order of the synchronization actions (§17.4.2) in t is consistent with the program order (§17.4.3) of t.
Synchronization actions induce the synchronized-with relation on actions, defined as follows:
... A write to a volatile variable (§8.3.1.4) v synchronizes-with all subsequent reads of v by any thread (where subsequent is defined according to the synchronization order).
Looks synchronized to me.
 Signature Knute Johnson email s/nospam/knute/
Thomas Hawtin - 06 Feb 2006 17:39 GMT > A write to a volatile variable (§8.3.1.4) v synchronizes-with all > subsequent reads of v by any thread (where subsequent is defined > according to the synchronization order). > > Looks synchronized to me. But increment has a read and a separate write. Read the value. Increment. Write the new value.
Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of things.
http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.ht ml#incrementAndGet()
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Knute Johnson - 06 Feb 2006 21:29 GMT >> A write to a volatile variable (§8.3.1.4) v synchronizes-with all >> subsequent reads of v by any thread (where subsequent is defined [quoted text clipped - 11 lines] > > Tom Hawtin Tom:
Yes, I see what you are saying. I'll rerun those tests with it synchronized and see what happens. If they were in fact passing between the read and write then the results would show lower performance for more threads.
 Signature Knute Johnson email s/nospam/knute/
Knute Johnson - 07 Feb 2006 00:35 GMT I tried synchronizing the access to the calculations variable but that made it run very poorly, it was interrupting the calculations. So I rewrote it in a different way and discovered some more interesting things. First this program is faster even on a single processor machine when more threads are used. I don't think that my XP Home can make a 100 threads but if I watch on the dual Xeon machine, it shows a 100 threads on the Task Manager.
I would be curious to know if anybody else duplicates my results. And if they have an idea why it would do more calculations on a single processor machine with more threads.
1.6Ghz XP Home SP2
Threads Calculations per ms 1 1155 2 1326 3 1367 4 1401 10 1444
2.8Ghz dual Xeon XP Pro SP2
Threads Calculations per ms 1 2402 2 4650 3 6141 4 7187 10 7608 100 8224 200 8281
import java.util.concurrent.*;
public class test3 implements Runnable { public static volatile boolean runFlag = true; public long calculations;
public test3() { new Thread(this).start(); }
public void run() { while (runFlag) { double d = Math.sqrt(1234.56789); double t = Math.tan(d); ++calculations; } }
public static void main(String[] args) { long totalCalcs = 0; int numberOfThreads = Integer.parseInt(args[0]); test3[] tests = new test3[numberOfThreads];
long then = System.currentTimeMillis(); for (int i=0; i<numberOfThreads; i++) { tests[i] = new test3(); } try { Thread.sleep(10000); } catch (InterruptedException ie) { System.out.println(ie); } tests[0].runFlag = false; long now = System.currentTimeMillis();
for (int i=0; i<numberOfThreads; i++) totalCalcs += tests[i].calculations;
System.out.println(totalCalcs/(now-then)); } }
 Signature Knute Johnson email s/nospam/knute/
Alex Buell - 07 Feb 2006 01:00 GMT On Mon, 06 Feb 2006 16:35:31 -0800 Knute Johnson <nospam@ljr-2.frazmtn.com> waved a wand and this message magically appeared:
> 1.6Ghz XP Home SP2 > [quoted text clipped - 4 lines] > 4 1401 > 10 1444 1.13Ghz Pentium III, Linux 2.6.15
1 2442 2 2433 3 2430 4 2409 10 2288
Is that useful for you?
 Signature http://www.munted.org.uk
"Honestly, what can I possibly say to get you into my bed?" - Anon.
Knute Johnson - 07 Feb 2006 22:52 GMT > On Mon, 06 Feb 2006 16:35:31 -0800 Knute Johnson > <nospam@ljr-2.frazmtn.com> waved a wand and this message magically [quoted text clipped - 18 lines] > > Is that useful for you? Thanks. That's more like what I expected.
 Signature Knute Johnson email s/nospam/knute/
Thomas Hawtin - 07 Feb 2006 02:25 GMT > I would be curious to know if anybody else duplicates my results. And > if they have an idea why it would do more calculations on a single > processor machine with more threads. > > 1.6Ghz XP Home SP2 I can't reproduce that result. On both a 2.26 GHz Celeron D and 266 MHz PII, running Fedora Core 4. Odd. Perhaps there is another program eating cycles and Windows allocates more to the program with most threads...
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Philipp Kayser - 07 Feb 2006 09:10 GMT Hi,
> I would be curious to know if anybody else duplicates my results. And > if they have an idea why it would do more calculations on a single > processor machine with more threads. I can reproduce the results:
AMD 64 X2 3800+ XP Pro SP2
Threads Calculations per ms 1 3650 2 6784 3 6780 4 6800 10 7010
It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe there are still some conflicts here due to "false-sharing" between some Thread stacks and/or Thread objects. And when there are more threads, the probability for cache invalidation is reduced -> more calculations per second.
Best regards, Philipp.
Philipp Kayser - 07 Feb 2006 09:25 GMT Addition to my previous post:
AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU
Threads Calculations per ms 1 3660 2 3708 3 3717 4 3771 10 3775
There is also an increase here, but not as big as in your test. Maybe because of the massive Thread creation: the OS cannot hold the Sleep(10000), but instead holds 11s or so.
Best regards, Philipp.
Knute Johnson - 07 Feb 2006 23:02 GMT > Addition to my previous post: > [quoted text clipped - 13 lines] > Best regards, > Philipp. Those are both interesting, thanks. That AMD looks like it zips right along too.
 Signature Knute Johnson email s/nospam/knute/
Luc The Perverse - 07 Feb 2006 11:26 GMT > Hi, > [quoted text clipped - 18 lines] > the probability for cache invalidation is reduced -> more calculations > per second. I will tell you this. If ANYTHING is being swapped to memory then your two cores are going to have to share a memory controller/and sharing is always slower than running alone.
I take it you don't expect this to be happening, but if you are running XP PRO then windows will force your process into the background to do OS related stuff, even if just briefly. This could easily account for the slowing.
-- LTP
:) Roedy Green - 07 Feb 2006 12:46 GMT On Tue, 7 Feb 2006 04:26:15 -0700, "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly quoted someone who said :
>I will tell you this. If ANYTHING is being swapped to memory then your two >cores are going to have to share a memory controller/and sharing is always >slower than running alone. Do this multicore chips, do they typically share the same SRAM cache or are they busily worrying about stale copies in each other's caches?
Do they share a port to the SRAM or is the SRAM multiported?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Luc The Perverse - 07 Feb 2006 23:46 GMT > On Tue, 7 Feb 2006 04:26:15 -0700, "Luc The Perverse" > <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly [quoted text clipped - 9 lines] > > Do they share a port to the SRAM or is the SRAM multiported? According to a randomly selected product information page they appear to each have completely seperate L1 and L2 caches. I think it is truly two chips on one die, with an integrated memory controller independant of the two.
Although I will admit that I don't really understand your last sentence.
 Signature LTP
:) Roedy Green - 08 Feb 2006 07:55 GMT On Tue, 7 Feb 2006 16:46:25 -0700, "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly quoted someone who said :
>Although I will admit that I don't really understand your last sentence. In the olden days Univac and Burroughs mainframes had multiple CPUs, a novelty at the time. You could pay extra for multiported "core" (which might have been iron cores, plated wire or RAM) so that each CPU could access core without waiting for the others. The feature was called "multiported memory". I wondered if the equivalent existed today in the access to SRAM and RAM.
I believe IBM mainframes also had a form of multiported memory so that I/O channels had a back door into memory that did not block the CPU.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 07 Feb 2006 09:36 GMT > I would be curious to know if anybody else duplicates my results. And > if they have an idea why it would do more calculations on a single [quoted text clipped - 8 lines] > 4 1401 > 10 1444 Odd. I find the same thing (I changed your code to create the threads outside the timed section, but to start() them inside it):
WinXP Pro, sp1. 1.5 GHz uniprocessor. JDK 1.5.0-b64: 1 2794 2 2816 3 2841 4 2834 10 2845
Win2K, sp4. 2.5 GHz uniprocessor. JDK 1..5.0_05-b05: 1 2078 2 2107 3 2104 4 2115 10 2123
On same Win2K box as above but using VMWare to run SuSE 9 + JDK 1.4.2-b28: 1 1715 2 1705 3 1783 4 1815 10 2029
Note that the last set of figures show that it cannot be an artefact of Windows scheduling since Linux processes running inside VMWare are not visible to the Windows scheduler, nor is memory use, etc.
I don't have an explanation (yet?) I rather suspect that at least part of it is to do with when/how the JIT gets around to doing optimisation. I'll probably investigate that later.
-- chris
Chris Uppal - 07 Feb 2006 10:15 GMT I wrote:
> I don't have an explanation (yet?) I rather suspect that at least part > of it is to do with when/how the JIT gets around to doing optimisation. > I'll probably investigate that later. Me, back again sooner than I'd planned.
First off, I realised that my modified code accidentally left the old constructor in, and so started a new thread in each constructor /as well/ as the ones I was using for testing -- stupid. Still, it doesn't seem to have affected the results significantly.
However, putting a loop around the test loop, to allow the JITer more time to do its thing, produces the following (all on the same 1.5 GHz WinXP box as before):
Threads Run1 Run2 .... 1 2757 2906 2910 2919 2907 2 2806 2907 2892 2901 2904 3 2815 2907 2909 2909 2905 4 2826 2902 2908 2903 2913 10 2848 2910 2911 2909 2915
As you'll see, by the third time around the loop, the apparent corellation between the calculations/sec and the number of threads had vanished into the noise. So I think the effect must be purely a question of how early the JITer kicks in.
-- chris
Roedy Green - 07 Feb 2006 12:38 GMT On Tue, 7 Feb 2006 10:15:50 -0000, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>As you'll see, by the third time around the loop, the apparent corellation >between the calculations/sec and the number of threads had vanished into the >noise. So I think the effect must be purely a question of how early the JITer >kicks in. See http://mindprod.com/jgloss/benchmark.html
I did a warm up "conformance test" partly whose job was to ensure the jit optimisations had been done before I starting timing anything.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Gordon Beaton - 07 Feb 2006 10:37 GMT > I would be curious to know if anybody else duplicates my results. > And if they have an idea why it would do more calculations on a > single processor machine with more threads. I don't see that here, these curves are *flat*:
n A B C 1 2983 2313 361 2 2983 2307 356 3 2966 2338 348 4 3008 2332 355 5 2992 2337 351 6 2989 2327 339 7 3002 2329 356 8 2986 2327 358 9 2983 2336 356 10 2976 2334 350 100 2983 2330 358 A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2 B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3 C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0
Interesting to note is that the Pentium manages only 8 times the work, despite a nearly 23x "speed" advantage over the Ultra 1!
/gordon
 Signature [ do not email me copies of your followups ] g o r d o n + n e w s @ b a l d e r 1 3 . s e
Chris Smith - 07 Feb 2006 16:28 GMT > A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2 > B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3 > C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0 > > Interesting to note is that the Pentium manages only 8 times the work, > despite a nearly 23x "speed" advantage over the Ultra 1! Not too surprising, though. Clock speed wars in modern day are basically marketing nonsense. Performance improvements come mainly from cache architectures, pipelining strategies, etc. Since average computer users (and even average high-tech gamers) don't know enough to compare those things, manufacturers keep up the charade by bumping up the clock speeds, as well. In truth, the CPU spends most of those blazingly fast clock cycles waiting on RAM.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Knute Johnson - 07 Feb 2006 22:57 GMT >> I would be curious to know if anybody else duplicates my results. >> And if they have an idea why it would do more calculations on a [quoted text clipped - 23 lines] > > /gordon I was surprised that my 2.8Ghz Xeon running the test with only one thread wasn't that much faster than the 1.6Ghz P4.
 Signature Knute Johnson email s/nospam/knute/
blmblm@myrealbox.com - 07 Feb 2006 10:28 GMT >> A write to a volatile variable (§8.3.1.4) v synchronizes-with all >> subsequent reads of v by any thread (where subsequent is defined [quoted text clipped - 4 lines] >But increment has a read and a separate write. Read the value. >Increment. Write the new value. Exactly the point I was trying to make earlier. Thanks.
>Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of >things. > >http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.ht ml#incrementAndGet()
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. blmblm@myrealbox.com - 07 Feb 2006 10:23 GMT >> Interesting second test program (test3 below). Some questions, though: >> [quoted text clipped - 8 lines] >Instead of the answer I want to give (which is bite me) I'll tell you >why that is in there. Well, I was in too much of a hurry to send that post out, and didn't phrase the question as diplomatically as I might have. Sorry about that. The question I wanted to ask was whether there was some subtle reason for the apparently ugly hack (one that I wasn't getting), or whether it was just a quick-and-dirty thing. NTTAWWT, in some contexts, and this is one of them.
>I wanted to make sure that I wasn't using up a >bunch of time in the thread creation. Meaning that you didn't want to start timing until all the threads were created, right? Yes, that's what I figured.
>Turns out it doesn't really make >any significant difference. And I was in a hurry and I know it was an >ugly hack and it was simple(ton) :-). [ snip ]
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Nigel Wade - 06 Feb 2006 11:25 GMT > Hi, > [quoted text clipped - 138 lines] > } > } I would say it's almost certainly a dirty cache problem due to SMP access to shared memory.
Just look at what's happening in the tightest part of the loop. When one thread writes a value in the array the entire cache line containing that particular element of the array is marked dirty (don't forget that in a SMP system this affects the cache line in other processors). This means that before any other thread can either read or write the same cache line that the cache line must be re-read from main memory. Since each thread is writing to consecutive elements of the array, they are almost guaranteed to all be writing to the same cache line.
This is a standard problem in multi-threaded SMP applications. Get each thread to keep the updating value in thread local storage, and only fill in the array when the computation is complete. Compare run times.
 Signature Nigel Wade, System Administrator, Space Plasma Physics Group, University of Leicester, Leicester, LE1 7RH, UK E-mail : nmw@ion.le.ac.uk Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555
slippymississippi@yahoo.com - 08 Feb 2006 01:01 GMT I think you've got the wrong idea of the benefits of threads.
Most markedly, threads will reduce your memory usage. To put it very simply, a thread is like a lightweight process, so you have reduced memory overhead when compared to processes. In other words, it's a lot less memory intensive to have one application supporting 20 threads reading input over sockets than configuring 20 applications to read that data.
As a result of these memory benefits, you get performance benefits as a byproduct, because the OS has to page swap a lot less often.
I'm coming from a C++ background where I simply created my own threads, established a mutex for my critical data and/or resources (queue), and let fly. So I'm not real fond of Java's screwy implementation of threads, where every object on the face of the planet has its own mutex, and Java gives you a lot of rope to hang yourself waiting for the mutex availability. But one thing I can see looking at your code is that you're using a lot of thread setup to calculate a very simple result. The correct analogue is to perform the same task with processes (does Java even support fork?).
A thread is a very useful tool. A hammer is a very useful tool, too, but you wouldn't want to cut a tree down with it. In the same manner, you want to use threads for what they're designed: server-type processes that read data from multiple inputs or perform other asynchronous tasks with less overhead than processes.
slippymississippi@yahoo.com - 08 Feb 2006 04:07 GMT Another thing I see in your code that sends up red flags is "synchronized(this)" on your thread class ... the correct usage for mutex protection is something like a library card system. You have a central resource (book) that multiple threads (customers) want to use, so you have a system of tracking when the resource is checked out (library card). One person takes the book, the librarian takes the card from the book. The next customer in line has to wait for the librarian to return the card to the book and call "next!" (notify/notifyAll) before the customer can begin reading. Because you want to limit the number of people waiting for books, you assign the library card to the smallest atomic entity possible (book) rather than larger objects (shelf, room, floor, library).
So, given my brief experience with Java, I believe you should be using synchronize on a very small static singleton bean that encapsulates your data and does very little else ... or simply enforce synchronization (because you never know how someone else is going to abuse your classes) by setting the data put/get methods to synchronized on your bean. And yes, unless the data is remarkably atomic, you have to protect your getters as well as your setters, because there's nothing stopping the OS from swapping out a thread in the middle of a data update.
In general, the more atomic the data being protected by the mutex, and the smaller the amount of code updating this data, the more efficient will be your code.
blmblm@myrealbox.com - 09 Feb 2006 10:22 GMT >Another thing I see in your code that sends up red flags is >"synchronized(this)" on your thread class ... It's possible that this was being used not so much to guarantee mutual exclusion as to allow use of wait/notify, which can only be done from within a synchronized block. (The OP's use of wait and notify seemed a little confused to me, but I haven't thought through the problem in enough detail to be sure.)
>the correct usage for >mutex protection is something like a library card system. You have a [quoted text clipped - 21 lines] >the smaller the amount of code updating this data, the more efficient >will be your code. Some good points here, but it might be as well to keep in mind in critiquing the OP's code that it appears to have been a toy program intended to find out how much performance improvement can be achieved with threads and multiple processors, so some advice that would be relevant in other contexts might not apply.
(In fact, I think the OP's program could be rewritten to not use any explicit synchronization except join() -- create the threads, start timing, start the threads, wait for them to finish with join(), combine results, stop timing. If each thread writes results into a separate element of an array, no data synchronization is needed. Some caution is needed, though, to avoid "false sharing" of cache.)
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. blmblm@myrealbox.com - 09 Feb 2006 10:09 GMT >I think you've got the wrong idea of the benefits of threads. > [quoted text clipped - 14 lines] >mutex, and Java gives you a lot of rope to hang yourself waiting for >the mutex availability. Maybe it would make more sense if you regarded it not as a "screwy implementation of threads" but as an implementation of (some of) the "monitors" synchronization mechanism first proposed (AFAICT) by Per Brinch Hansen and C.A.R. Hoare in the 1970s.
One thing Java gives you that thread libraries may not (I'm basing this on imperfect recollections of using the Pthreads library) is a reasonable mechanism for doing kinds of synchronization other than simple mutual exclusion -- e.g., what you would need for a simple producer/consumer setup (in which one thread produces data for another to consume, and you would like the consumer thread to wait until there is data to consume, preferably without busy-waiting).
>But one thing I can see looking at your code >is that you're using a lot of thread setup to calculate a very simple >result. The correct analogue is to perform the same task with >processes (does Java even support fork?). Are you saying that, if the goal is improve performance by splitting computation among processors, this is better done with processes than threads? Why? In some ways processes communicating by message-passing is a less error-prone way to "parallelize" code (exploit multiple processors) than threads communicating via shared memory, and of course if you don't *have* shared memory it's the only choice. But if you can do either one and want maximum performance, it seems to me that threads are the way to go. Do you disagree?
>A thread is a very useful tool. A hammer is a very useful tool, too, >but you wouldn't want to cut a tree down with it. In the same manner, >you want to use threads for what they're designed: server-type >processes that read data from multiple inputs or perform other >asynchronous tasks with less overhead than processes. Or to improve performance if you have more than one processor. (Or even if you don't, and your code can be split up into threads in a way that allows one thread to make progress while another thread waits for something to happen, I/O for example -- i.e., if threads can be used to "mask latency".)
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Dimitri Maziuk - 09 Feb 2006 17:56 GMT blmblm@myrealbox.com sez:
> Are you saying that, if the goal is improve performance by splitting > computation among processors, this is better done with processes > than threads? Why? Threads (as in "lightweight processes") were originally introduced to avoid the overhead of process creation. Threads are not the only solution: Plan 9 folks came up with low-overhead process creation mechanism, Linux picked it up quickly. In a modern OS with low-cost process creation mechanism the benefits of threads (vs. processes) are small, and the overhead is an extra layer of complexity.
This is a moot point in Java, of course, since Java only has threads. There Is No Fork().
... But if you can do either one and want maximum performance,
> it seems to me that threads are the way to go. Do you disagree? Two main sources of overhead are initial creation and synchronization. If they are fully independent (no synchronization) and long-running (creation cost amortized to close to 0 over runtime), there should be no difference between threads and processes performance-wise.
>>you want to use threads for what they're designed: server-type >>processes that read data from multiple inputs or perform other [quoted text clipped - 5 lines] > for something to happen, I/O for example -- i.e., if threads can be > used to "mask latency".) Only if you have several execution streams that can run asynchronously.
Anyway, the real deal is that threads don't economically scale past 4 CPUs (4-way being the cusp of bang-per-buck curve) and thus 4-16 threads: e.g. Sun will happily sell you a 24-CPU cupboard for a cool million bucks, or 24 1-CPU pizza boxen for about $18,000. Which means that if you're doing it for scalability, you're way better off going fully distributed (hence, independent processes) or with some specialized grid/cluster setup (probably different programming techniques altogether).
Dima
 Signature We're sysadmins. Sanity happens to other people. -- Chris King
blmblm@myrealbox.com - 11 Feb 2006 18:51 GMT >blmblm@myrealbox.com sez: >> [quoted text clipped - 9 lines] >(vs. processes) are small, and the overhead is an extra layer of >complexity. Threads introduce an extra layer of complexity? Hm. I'm not sure what you mean there ....:
I can believe that in current operating systems it's almost as fast to create processes as to create threads. Context switch times might also need to be taken into account, but maybe there too the difference is not really significant.
But my understanding is that a key difference between threads and processes is that the former share an address space and the latter do not, and this leads to rather different programming models -- communication via message-passing versus communication via shared variables. Shared variables in some ways seem easier and more natural, but the need for synchronization means that there are lots of potential pitfalls. Message passing is less easy/natural but avoid some of the potential pitfalls.
All of this is of course moot if the threads/processes are totally independent. I'm coming to this discussion from a background in parallel computing, though, where the threads/processes are rarely *completely* independent.
>This is a moot point in Java, of course, since Java only has >threads. There Is No Fork(). Um, no, but doesn't the "exec" method in Runtime do something along those lines? I have zero experience working with this method; I'm going by the API and by vague recollections of discussions in this group.
>... But if you can do either one and want maximum performance, >> it seems to me that threads are the way to go. Do you disagree? [quoted text clipped - 4 lines] >to close to 0 over runtime), there should be no difference >between threads and processes performance-wise. Probably so. I wasn't really thinking of totally independent threads/processes.
Also, if you're using the additional threads/processes to mask latency, then context switch times need to be taken into account. Maybe there too the difference is less than I think -- threads would seem to win out, but perhaps not by much.
>>>you want to use threads for what they're designed: server-type >>>processes that read data from multiple inputs or perform other [quoted text clipped - 8 lines] >Only if you have several execution streams that can run >asynchronously. Sure. I do enough of this multithreading-for-performance (as a research interest more than for practical reasons) that I'm apt to forget that not everyone thinks in those terms, maybe.
>Anyway, the real deal is that threads don't economically scale >past 4 CPUs (4-way being the cusp of bang-per-buck curve) and [quoted text clipped - 4 lines] >processes) or with some specialized grid/cluster setup (probably >different programming techniques altogether). Traditionally this has been true, yes. I wonder if it will continue to be true in the next few years -- dual-core chips are not the rare beasts they were a few years ago, and one hears so much about multicore chips in general that I wonder whether in a few years there won't be a lot more 4-way and 8-way systems .... But I'm not a hardware designer, and it may be that there's still no way to make shared-memory systems scalable.
In any case, it seems to me that arguments about scalability, while interesting and important, miss the point that if what you have is a 2-processor or 4-processor system, a program that exploits the extra processors is a good thing to have, even if it doesn't scale to more than 4 processors.
Just some thoughts / my two cents' worth.
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 12 Feb 2006 05:33 GMT >I can believe that in current operating systems it's almost as fast >to create processes as to create threads. nowhere near as fast, especially if you start several threads using the same classes. The classes are already loaded so there is little to do other than allocate some ram for a stack.
With an external process you must load a partly compressed exe file, and relocate all the addresses in it, and hook up all the DLLs, then you can think about loading the DLLs.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
blmblm@myrealbox.com - 12 Feb 2006 14:45 GMT >>I can believe that in current operating systems it's almost as fast >>to create processes as to create threads. > >nowhere near as fast, especially if you start several threads using >the same classes. The classes are already loaded so there is little to >do other than allocate some ram for a stack. Well, what you're saying here is that *for Java programs* it's a lot faster to start a new thread than a new process, right? which is more relevant in this newsgroup than a general discussion of processes versus threads -- but in the post you're replying to, the discussion had (I thought) become more general.
>With an external process you must load a partly compressed exe file, >and relocate all the addresses in it, and hook up all the DLLs, then >you can think about loading the DLLs. The terminology here is somewhat Windows-specific, but .... Okay, okay, similar ideas (exe files and DLLs) exist in other operating systems. I'm a little skeptical that everything you describe has to be done, however:
>load a partly compressed exe file, I was under the impression that "smart" operating systems can allow multiple processes running the same code to share a copy in memory. But maybe that only works if the processes do something special?
>and relocate all the addresses in it, Relocate addresses? I thought this was something that had been necessary in the not-so-good old days, before the advent of dynamic address translation, but didn't need to be done now? I may be misunderstanding what you're saying though, or more confused than I think about how things work under the hood.
>and hook up all the DLLs, then >you can think about loading the DLLs. Which .... If you're starting a new process to execute the same code as an existing process, that shouldn't be necessary, right? Shouldn't the relevant DLLs already be loaded? Do I not understand how DLLs work? (I would have thought that two processes using the same one could share an in-memory copy.) And if it's executing new code, then you have the same issues with starting a new thread, no?
I'm not really sure about all of this, but notice that the person to whom I replied said this:
>>Threads (as in "lightweight processes") were originally introduced >>to avoid the overhead of process creation. Threads are not the [quoted text clipped - 3 lines] >>(vs. processes) are small, and the overhead is an extra layer of >>complexity. So it may depend on the operating system?
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 12 Feb 2006 16:27 GMT >The terminology here is somewhat Windows-specific, but .... Okay, >okay, similar ideas (exe files and DLLs) exist in other operating >systems. I'm a little skeptical that everything you describe has >to be done, however: IIRC one of the old DEC oses RSTS-11? allowed you to pre-relocate programs for fast load. It then only had to load them without processing.
IIRC the ancient Multics OS just memory mapped your program into the address space to start it. Instant on.
Univac-1100 has hardware relocation so load was trivial too.
IBM 360 and Windows made program load into a great production.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 12 Feb 2006 16:29 GMT >I was under the impression that "smart" operating systems can allow >multiple processes running the same code to share a copy in memory. >But maybe that only works if the processes do something special? With windows DLLs get shared. I don't know about ordinary code. It would have to load at the same spot each person's address space to be shared. Seems to me with NT they made a partial attempt to ensure that happens.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 12 Feb 2006 16:31 GMT >Relocate addresses? I thought this was something that had been >necessary in the not-so-good old days, before the advent of dynamic >address translation, but didn't need to be done now? I may be >misunderstanding what you're saying though, or more confused than >I think about how things work under the hood. loot at the exe header. All the relocation information is still there. I think what they do is set them up so that IF the program loads at a certain location, there is less fixup to do.
Even then there are a zillion hooks to dlls etc to arrange.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
blmblm@myrealbox.com - 13 Feb 2006 10:30 GMT >>Relocate addresses? I thought this was something that had been >>necessary in the not-so-good old days, before the advent of dynamic [quoted text clipped - 5 lines] >I think what they do is set them up so that IF the program loads at a >certain location, there is less fixup to do. Seems like with every process having its own address space it usually *would* be the case that the program loads at the same location (in the address space) every time?
(Good suggestion about looking at the format of executable files. Something else I really must do sometime.)
>Even then there are a zillion hooks to dlls etc to arrange. Yeah ....
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 12 Feb 2006 16:32 GMT >Which .... If you're starting a new process to execute the same >code as an existing process, that shouldn't be necessary, right? >Shouldn't the relevant DLLs already be loaded? the DLLS may be loaded, but the executable contains only symbolic links into the DLL, those all have to be resolved and patched.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
blmblm@myrealbox.com - 13 Feb 2006 10:34 GMT >>Which .... If you're starting a new process to execute the same >>code as an existing process, that shouldn't be necessary, right? >>Shouldn't the relevant DLLs already be loaded? > >the DLLS may be loaded, but the executable contains only symbolic >links into the DLL, those all have to be resolved and patched. Well, you did say, in the post to which I was responding:
>>>then you can think about loading the DLLs.
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 12 Feb 2006 16:33 GMT > Do I not understand >how DLLs work? (I would have thought that two processes using the >same one could share an in-memory copy.) And if it's executing new >code, then you have the same issues with starting a new thread, no? DLLs used to share code and memory. Now they just share code. You will notice how java.exe is more spritely if you have recently used it. That is because its DLLs are still hanging around in RAM and don't have to be reloaded.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
blmblm@myrealbox.com - 13 Feb 2006 10:40 GMT >> Do I not understand >>how DLLs work? (I would have thought that two processes using the [quoted text clipped - 4 lines] >You will notice how java.exe is more spritely if you have recently >used it. The name of the executable I use is java, not java.exe. (Said with sort of a :-), because from another of your posts it's apparent that you do know, at least as well as I do, that there are many operating systems.)
>That is because its DLLs are still hanging around in RAM and >don't have to be reloaded. But, but, didn't you just say DLLs just share code, not memory ....
| B. L. Massingill | ObDisclaimer: I don't speak for my employers; they return the favor. Roedy Green - 13 Feb 2006 14:37 GMT >But, but, didn't you just say DLLs just share code, not memory .... In the old days there was no easy mechanism for each user of a dll to have his own instance of the data. The DLL had to manage its pool of users allocating each some workspace from its own pool. Today separate instances in the default I believe each client of the DLL addresses its data at the same virtual address. Presumably there are features to share read-only or read-write areas of RAM too. Those are common OS features.
I don't know Just how clever has Sun been about avoiding literally loading all the standard class files, jitting them and optimising them on every java.exe launch. I would hope they have some scheme to prebuild the standard class set so that the digested classes looks to the OS like read-only code, or a hunk of memory mapped data, or a hunk or read-only memory mapped data or ...
Back in 1985 I invented a scheme called Gespenstering for capturing ram images of a program in flight and turning them into a relocatable executable snapshot. That saved a huge amount of complicated initialisation, similar to loading class files, in my Forth/Abundance language. I did this for DOS. Presumably you could do something similar for Windows. I discuss this at http://mindprod.com/projects/gespenster.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Gordon Beaton - 12 Feb 2006 15:35 GMT > nowhere near as fast, especially if you start several threads using > the same classes. The classes are already loaded so there is little > to do other than allocate some ram for a stack. On Linux, the kernel makes no distinction between threads and processes, so they are identically fast to create and schedule. This has nothing to do with classes.
> With an external process you must load a partly compressed exe file, > and relocate all the addresses in it, and hook up all the DLLs, then > you can think about loading the DLLs. On Unix (and Linux), common shared libraries (most notably libc) will already be loaded before the process starts, so they don't need to be loaded again for every process. Also the kernel normally caches recently used code, so it is faster to load the same program a second time.
/gordon
 Signature [ do not email me copies of your followups ] g o r d o n + n e w s @ b a l d e r 1 3 . s e
Roedy Green - 12 Feb 2006 16:34 GMT >On Linux, the kernel makes no distinction between threads and >processes, so they are identically fast to create and schedule. This >has nothing to do with classes. Perhaps we are using a term differently. by Thread I presume shared address space. By process I presume a new address space, e.g. via exec.
What do you mean by the terms?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Gordon Beaton - 12 Feb 2006 17:06 GMT > Perhaps we are using a term differently. by Thread I presume shared > address space. By process I presume a new address space, e.g. via > exec. > > What do you mean by the terms? On Linux, the kernel only knows about "tasks". They can share none, part, or all of their address space with one another. When they don't share any address space, they look like traditional processes. When they share all of their address space, they look like threads (even though they each task has its own pid). But the kernel does not treat them any differently, so it cannot be claimed that one is faster or cheaper to create than the other.
It is this lack of distinction that previously confused many people who did not understand why "ps" often showed so multiple Java processes on Linux.
/gordon
 Signature [ do not email me copies of your followups ] g o r d o n + n e w s @ b a l d e r 1 3 . s e
Roedy Green - 13 Feb 2006 09:56 GMT >On Linux, the kernel only knows about "tasks". They can share none, >part, or all of their address space with one another. When they don't [quoted text clipped - 7 lines] >who did not understand why "ps" often showed so multiple Java >processes on Linux. I presume it tracks task groups so you can kill all related tasks at once.
But still it takes a heck of a lot longer to start a task from a raw executable than from code already in RAM. The task may look the same after they are born, but the threads have a much easier birth. Perhaps Linux conceptually separates out loading from starting the task. If that is so then indeed tasks and threads could look identical, even in the way they are started.
What we should be aiming for is MOST of the time, to start a program all you do is map its code into virtual memory and start executing, causing an immediate page fault. If something happens to disturb the memory maps, then they should be automatically rebuilt, perhaps in the background when the system in idle, not built from scratch on every execution. You could get very close to the idea that programs never stop, just sleep.
see http://mindprod.com/projects/gespe
|
|