Java Forum / First Aid / February 2005
Seen java -server optimization vary 50%?
Don Taylor - 26 Feb 2005 06:28 GMT Hotspot 1.4.1-01 running as java -server appname running on a 2+ghz AMD part 512 meg of memory, installed straight off th Knoppix 3.2/Debian CD, no other users, no other activity, no networking, just run this from the command line and do println().
Wrote some code that gets an expression from the user and then beats itself to death doing bigInteger multiply, mod, gcd, etc.
I time and print this every hundred iterations.
The odd part, every hundred iterations of a run will take the same time to better than 1/2% accuracy. But if I kill it and start it over the time can be substantially different, but still each hundred is the same to better than 1/2%. These hundred iterations can take anywhere from ten to fifteen minutes, depending on the run. Not recompiling between the runs, doing nothing but killing it and starting it over. What I do now is run and kill a few times, until I get one of the faster times, and then leave it running for days.
Ever seen anything like this? The only thing I can imagine is that the runtime optimizer is doing a better or worse job for some unknown reason, at startup, and then not changing it when the thousands of iterations follow.
Any idea how I might get it to choose the fast time every time? Thanks
dar7yl - 27 Feb 2005 07:39 GMT > Hotspot 1.4.1-01 running as java -server appname > running on a 2+ghz AMD part 512 meg of memory, [quoted text clipped - 24 lines] > Any idea how I might get it to choose the fast time every time? > Thanks This looks suspiciously like a garbage-collection problem. The garbage collector can run at any time, and sometimes can take polynomial time to do it's dubious magic.
You can experiment with System.gc() to force garbage collection at your convenience instead of letting it freewheel. I have had significant improvements in my app by forcing gc() at checkpoints, especially when I am aware of considerable object creation and deletion activity.
regards, Dar7yl
Don Taylor - 27 Feb 2005 18:09 GMT ...
>> I time and print this every hundred iterations. >> [quoted text clipped - 7 lines] >> What I do now is run and kill a few times, until I get one >> of the faster times, and then leave it running for days.
>This looks suspiciously like a garbage-collection problem. The >garbage collector can run at any time, and sometimes can take >polynomial time to do it's dubious magic. If it were garbage collection then I would tend to think that the variation from one iteration of 100 cycles to the next 100 cycles would vary more. I have tens of thousands of 100 cycle iterations, with times varying from 596301 to 596335 milliseconds. But if I kill this and restart it I can easily discover the next tens of thousands of 100 cycle iterations will vary from 967208 to 967241 milliseconds. Kill it again and find any other number between 600 and 1000 seconds/100 iterations, repeatably.
That does not sound like the garbage collector to me.
>You can experiment with System.gc() to force garbage collection >at your convenience instead of letting it freewheel. I have had >significant improvements in my app by forcing gc() at checkpoints, >especially when I am aware of considerable object creation and >deletion activity. Do you still think this sounds like gc? I'm skeptical.
thanks
Patricia Shanahan - 27 Feb 2005 12:33 GMT > Hotspot 1.4.1-01 running as java -server appname > running on a 2+ghz AMD part 512 meg of memory, installed [quoted text clipped - 27 lines] > Any idea how I might get it to choose the fast time every > time? Thanks I have seen this sort of behavior in C and Fortran programs whose most frequently executed loops do a lot of accesses to data structures that were all allocated once, at the start of the run.
Modern processors use a lot of indexes, accessed by virtual or physical address, to control caching and address translation.
See http://www.pcguide.com/ref/mbsys/cache/funcMapping-c.html for an explanation of some important indexing approaches. Unfortunately, fully associative indexing is VERY expensive, so most indexing is direct mapped or set associative.
For both direct mapped and set associative with reasonably small set size, combinations of addresses can lead to replacement due to collisions, even when there is plenty of space. Other combinations of addresses don't have those collisions.
Patricia
Don Taylor - 27 Feb 2005 18:15 GMT >> Hotspot 1.4.1-01 running as java -server appname >> running on a 2+ghz AMD part 512 meg of memory, installed [quoted text clipped - 27 lines] >> Any idea how I might get it to choose the fast time every >> time? Thanks
>I have seen this sort of behavior in C and Fortran programs >whose most frequently executed loops do a lot of accesses to >data structures that were all allocated once, at the start >of the run. If the first 100 passes took substantially longer I would not be at all surprised. (This is a tiny little loop, a couple of lines literally, that takes all the time but there is setup before it gets to this) But every 100 loops take between 596301 and 596335 milliseconds, thousands and thousands of times it takes this long. But if I do nothing but kill it and restart it I might find it takes between 967328 and 967365 milliseconds, thousands and thousands of times it will take that long.
>Modern processors use a lot of indexes, accessed by virtual >or physical address, to control caching and address translation.
>See >http://www.pcguide.com/ref/mbsys/cache/funcMapping-c.html >for an explanation of some important indexing approaches. >Unfortunately, fully associative indexing is VERY expensive, >so most indexing is direct mapped or set associative. understood
>For both direct mapped and set associative with reasonably >small set size, combinations of addresses can lead to >replacement due to collisions, even when there is plenty of >space. Other combinations of addresses don't have those >collisions. I'm still skeptical why tens of thousands of iterations can all take the same time, and that time varies by more than 50% if I happen to kill and restart the exact same binary with the exact same (nearly zero) load on the machine.
Does it still sound like your explanation is plausible? Why the huge variation from run to run and almost no variation inside a run?
thanks
Patricia Shanahan - 27 Feb 2005 18:36 GMT > Patricia Shanahan <patricia_shanahan@earthlink.net> > writes: [quoted text clipped - 77 lines] > > thanks I know the explanation is plausible in general, because in the case of the C and Fortran programs I had access to special operating system hooks and a hardware logic analyzer on the main bus. I could see exactly what was going on, including the differences in mappings that caused the differences in performance. They exhibited exactly the behavior you describe. As the program loaded and allocated its data structures the address mappings were established, and that controlled the performance for the rest of that run. Each new run rolled the memory mapping dice again, and might get different inner loop performance.
To decide whether this explanation applies to your program, you need to look at your most heavily executed loops. Are most references to a fixed set of data structures, all allocated in the first few iterations? If so, it is at least a possible explanation. Conversely, if you are using data structures that are created and garbage collected repeatedly throughout the run, it is less likely to apply.
Paradoxically, this problem is more likely on a lightly loaded machine, especially one that is dedicated to a single application. The less contention there is for resources such as physical memory, the more likely the program is to go on running with physical addresses, as well as virtual addresses, that were established early in the run.
Patricia
Don Taylor - 27 Feb 2005 21:05 GMT >> ... >> If the first 100 passes took substantially longer I would [quoted text clipped - 18 lines] >> >> thanks
>To decide whether this explanation applies to your program, >you need to look at your most heavily executed loops. Are [quoted text clipped - 3 lines] >structures that are created and garbage collected repeatedly >throughout the run, it is less likely to apply. This is the loop. Everything, except 'd', is a bigInteger. The only 'new' grabs the date/time, once every 10-15 minutes.
while(a.multiply(a).compareTo(n)<=0&&(T.compareTo(one)==0||T.compareTo(n)== 0)){ b = a.modPow(w, n); T = n.gcd(b.modPow(n, n).subtract(b)); if (a.mod(hundred).compareTo(zero) == 0) { d = new Date(); finishtime = d.getTime(); duration = finishtime-starttime; System.out.println("a = " + a.toString() + ", duration=" + duration); starttime = finishtime; } a = a.add(one); }
>Paradoxically, this problem is more likely on a lightly >loaded machine, especially one that is dedicated to a single >application. The less contention there is for resources such >as physical memory, the more likely the program is to go on >running with physical addresses, as well as virtual >addresses, that were established early in the run. Is there any diagnostic information I could get to confirm what the underlying behavior really is? Or any guidelines you might have to force the behavior to be towards the fast execution, as opposed to the slow?
Thanks
Patricia Shanahan - 27 Feb 2005 21:34 GMT > Patricia Shanahan <patricia_shanahan@earthlink.net> > writes: ...
>> To decide whether this explanation applies to your >> program, you need to look at your most heavily executed [quoted text clipped - 21 lines] > a = a.add(one); > } I don't know, because the BigInteger arithmetic could contain a lot of hidden object creation. For example, if neither of the gcd operands is zero, gdc will create two new MutableBigInteger instances, do a gcd using them, and return a new BigInteger, at least three object creations.
>> Paradoxically, this problem is more likely on a lightly >> loaded machine, especially one that is dedicated to a [quoted text clipped - 10 lines] > > Thanks Diagnosis depends on the performance tools you have available for your operating system. Do you have access to e.g. cache miss counts for any caches? Look for consistent differences between good and bad runs.
If you have hardware specs, look for caches and translation tables with low associativity. Direct mapped is the most suspect, but I've seen it happen with two way set associative.
My main approach was to talk to the operating system developers about their memory allocation strategies. However, there is one slightly insane user-implementable technique that has sometimes helped. Before you try to run your program, deliberately overload the memory. Run a bunch of memory hog jobs whose total memory exceeds the size of the machine and let them fight over the memory. That had the effect of randomizing some mapping structures. Of course, make sure you clean them all out before trying your program.
Otherwise, if it is memory mapping, your current technique of seeing whether you have a good or bad mapping and restarting until you get an empirically good mapping is about the best that can be done.
Patricia
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|