Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2006

Tip: Looking for answers? Try searching our database.

Effective Multi Core Thread Programming

Thread view: 
jonasforssell@yahoo.se - 05 Jul 2006 23:13 GMT
Hello Experts.

This program does not give expected performance boost on my P4 with
HyperThreading under Linux

/****************************************************************************/
public class ThreadTester implements Runnable {
   int id;

   public ThreadTester(int tid) {
       this.id = tid;
   }

   public void run() {
       System.out.println("Starting thread: " + id);

       long stime = System.currentTimeMillis();

       for (int i = 0; i < 20000; i++)
           for (int j = 0; j < 20000; j++) {
               double p = i * j;
               p = Math.sqrt(p);
           }

       stime = System.currentTimeMillis() - stime;

       System.out.println("Solution for thread " + id + " took " +
stime + " ms");
   }

   public static void main(String[] args) {

       if (args.length != 1) throw new
IllegalArgumentException("\n\nSyntax is 'java ThreadTester x' where x
is number of threads \n");

       int cpu = Integer.parseInt(args[0]);
       Thread[] t = new Thread[cpu];

       for (int i = 0; i < cpu; i++)
           t[i] = new Thread(new ThreadTester(i));

       for (int i = 0; i < cpu; i++)
           t[i].start();

       try {
           for (int i = 0; i < cpu; i++) {
               t[i].join();
           }
       } catch (InterruptedException e) {
           e.printStackTrace();
       }
   }
}
/**************************************************************************/

And this is my output

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5381 ms

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5379 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11247 ms
Solution for thread 0 took 11321 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11241 ms
Solution for thread 0 took 11325 ms

With an effective core distribution, this should have similar values as
the first runs (< 6000 ms)

What have I done wrong? I thought the JVM would make a good
distribution automatically?

Many thanks
/Jonas Forssell, Gothenburg, Sweden
jonasforssell@yahoo.se - 05 Jul 2006 23:31 GMT
Additional input:

My machine has SMP support enabled in the Linux core. The system sees
two CPU:s.

I'm running JVM 1.4.2 Blackdown which is based on SUN source.

/Jonas
blmblm@myrealbox.com - 05 Jul 2006 23:55 GMT
>Additional input:
>
>My machine has SMP support enabled in the Linux core. The system sees
>two CPU:s.
>
>I'm running JVM 1.4.2 Blackdown which is based on SUN source.

Is this a dual-core machine, or a machine with a single hyperthreaded
processor?  if the latter, be advised that speedups produced by
hyperthreading apparently range from zero to about 30%, with pure
number-crunching (such as what you're doing) apt to *not* take
advantage of the hyperthreading magic.

I admit that I *am* surprised that you got what seems like a
significant slowdown, rather than just a lack of improvement.

Signature

B. L. Massingill
ObDisclaimer:  I don't speak for my employers; they return the favor.

Eric Sosman - 06 Jul 2006 15:51 GMT
>>Additional input:
>>
[quoted text clipped - 11 lines]
> I admit that I *am* surprised that you got what seems like a
> significant slowdown, rather than just a lack of improvement.

    Looks like roughly a 5% slowdown -- not great, but not
terrible.  Keep in mind that the two-thread version does
twice as much work, and still has only one FPU to use for
all those square roots.

Signature

Eric Sosman
esosman@acm-dot-org.invalid

jonasforssell@yahoo.se - 06 Jul 2006 16:05 GMT
Chris does not state his configuration, but surely there must be two
FPU:s here?

/Jonas

Eric Sosman skrev:

> >>Additional input:
> >>
[quoted text clipped - 16 lines]
> twice as much work, and still has only one FPU to use for
> all those square roots.
Chris Smith - 06 Jul 2006 19:00 GMT
> Chris does not state his configuration, but surely there must be two
> FPU:s here?

I don't know my configuration.  It is a dual-core system; I know that.  
Dell Inspiron E1405.  You can probably look up info as well as I can.

Signature

Chris Smith - Lead Software Developer / Technical Trainer
MindIQ Corporation

jonasforssell@yahoo.se - 06 Jul 2006 07:01 GMT
Could anyone run this on their multi Core/CPU machine and show me some
evidence it works properly in that environment

Thanks
/Jonas

jonasforssell@yahoo.se skrev:

> Additional input:
>
[quoted text clipped - 4 lines]
>
> /Jonas
Chris Smith - 06 Jul 2006 07:17 GMT
> Could anyone run this on their multi Core/CPU machine and show me some
> evidence it works properly in that environment

Starting thread: 0
Solution for thread 0 took 14734 ms

---------------------

Starting thread: 0
Starting thread: 1
Solution for thread 1 took 14641 ms
Solution for thread 0 took 14922 ms

Is that what you wanted?  Looks good to me.

Signature

Chris Smith - Lead Software Developer / Technical Trainer
MindIQ Corporation

hiwa - 06 Jul 2006 00:19 GMT
jonasforssell@yahoo.se :

> Hello Experts.
>
[quoted text clipped - 82 lines]
> Many thanks
> /Jonas Forssell, Gothenburg, Sweden
If it is *practically* a single processor machine, the result is only
natural.
Chris Uppal - 06 Jul 2006 11:47 GMT
>         for (int i = 0; i < 20000; i++)
>             for (int j = 0; j < 20000; j++) {
>                 double p = i * j;
>                 p = Math.sqrt(p);
>             }

How many independent floating-point units does a hyperthreaded P4 have ?  I'm
not certain, but I /think/ that HT processors share all their actual execution
units, and that there is only one FP-capable unit.  If so (and it's a fairly
big if) then this loop will saturate the FP unit even if only one thread is
running.  Having more, will necessarily increase overheads without providing
any benefit.

If that's correct, then the only exploitable parallelism here is that one
thread can be doing the integer arithmetic for the loops while the other is
doing the FP calculations.  But since, with pipelining and whatnot, the
processor could be doing that "in parallel" anyway (even with just one thread),
again, it seems that the two threads will be competing for a resource that
either one of them could saturate.

Another point, and probably a lot more important, is that this test code is not
testing anything useful.  The JITer will still be attempting to optimise the
loops while you are taking your single measurement (per thread).  So, (a) the
performance reported will not be at all representative, and (b) the JITer
itself will be competing for processor time (and cache space, etc) with the
benchmark threads.  Please note that the effects are large, and cannot simply
be ignored  -- if you fail to take account of the JITer's behaviour then the
results are quite likely not even to be indicative.

   -- chris
jonasforssell@yahoo.se - 06 Jul 2006 12:11 GMT
You are most probably correct. The hyperthreading does not double the
floating-point units and as the previous post show, there are examples
where this will truly execute in parallel. A HT P4 is not one of them.

Thanks for all the feedback!
/Jonas

Chris Uppal skrev:

> >         for (int i = 0; i < 20000; i++)
> >             for (int j = 0; j < 20000; j++) {
[quoted text clipped - 26 lines]
>
>     -- chris
Thomas Hawtin - 06 Jul 2006 16:10 GMT
> You are most probably correct. The hyperthreading does not double the
> floating-point units and as the previous post show, there are examples
> where this will truly execute in parallel. A HT P4 is not one of them.

Funnily enough the eight core, thirty two thread Sun Niagara will
presumably show the same results too. It has only one floating point
unit (shared across the cache cross bar). Sun don't recommend it for
applications with more than 1% floating point.

It seems where multiple hardware threads are most useful is to do work
while during memory latency. Intel's talk of sharing functional units
within the same cycle is good marketing, but apparently not that
significant in practice.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.