Java Forum / Virtual Machine / March 2006
Optimal x86-32 Sun Hotspot code generation?
Adam Warner - 24 Mar 2006 02:14 GMT Hi all,
I'm trying to create the fastest way to cast a Java long to an int while preserving array bounds checking. This is the approach I suspect should be optimal:
public final static int toIntIndex(long index) { int high=(int) (index>>>32); if (high!=0) throw new ArrayIndexOutOfBoundsException(); return (int) index; }
If the long index is positive and in int range high will be 0. If the long index is negative then high will be non-zero. If the long index is between 2^31 and below 2^32 then it will pass this test but still be caught by Java's int bounds checking.
I believe the unsigned right shift by 32 should permit the test to be conducted upon the 32 most significant bits of the 64-bit value, that is no shift should actually be performed on 32-bit platforms.
I've induced the Sun Mustang b75 HotSpot debug server JIT to compile toIntIndex and this is the generated assembly:
{method} - klass: {other class} - method holder: 'LongIndex' - constants: 0x0693c688{constant pool} - access: 0x81000019 public static final - name: 'toIntIndex' - signature: '(J)I' - max stack: 3 - max locals: 3 - size of params: 2 - method size: 22 - vtable index: -2 - code size: 21 - code start: 0xb0e45210 - code end (excl): 0xb0e45225 - method data: 0xb0e47a58 - checked ex length: 0 - linenumber start: 0xb0e45225 - localvar length: 0 # # int ( long, half ) # #r063 ESP+20: parm 0: long #r062 ESP+16: parm 0: long # -- Old ESP -- Framesize: 16 -- #r061 ESP+12: return address #r060 ESP+ 8: pad2, in_preserve #r059 ESP+ 4: pad2, in_preserve #r058 ESP+ 0: pad2, in_preserve # abababab N1: # B1 <- B3 B2 Freq: 6.66667 abababab 000 B1: # B3 B2 <- BLOCK HEAD IS JUNK Freq: 6.66667 000 # stack bang PUSHL EBP SUB ESP,8 # Create frame 00e MOV ECX,[ESP + #16] MOV EBX,[ESP + #20] 016 MOV ECX.lo,ECX.hi SHR ECX.lo,#32-32 XOR ECX.hi,ECX.hi 01a MOV ECX,ECX.lo 01a TEST ECX,ECX 01c Jne,s B3 P=0.000000 C=4.466667 01c 01e B2: # N1 <- B1 Freq: 4.46666 01e MOV ECX,[ESP + #16] MOV EBX,[ESP + #20] 026 MOV EAX,ECX.lo 028 ADD ESP,8 # Destroy frame POPL EBP TEST PollPage,EAX ! Poll Safepoint
032 RET 032 033 B3: # N1 <- B1 Freq: 1e-06 033 MOV ECX,#-67 038 NOP # Pad for loops and calls 039 NOP # Pad for loops and calls 03a NOP # Pad for loops and calls 03b CALL,static wrapper for: uncommon_trap # LongIndex::toIntIndex @ bci:10 L0=_ L1=_ L2=_ # 040 INT3 ; ShouldNotReachHere 040
This of course is the non-inlined version of toIntIndex. I don't understand some of the disassembly syntax (.hi, .lo?) but it at least appears clear that a redundant "SHR ECX.lo,#32-32" instruction is being generated. I'd appreciate confirmation my reasoning is correct/this is an actual inefficiency before filing any report with Sun.
Regards, Adam
Brendan - 24 Mar 2006 11:22 GMT Hi,
Does this thing have an optimizer that you forgot to turn on?
The stack frame is a waste of time, they've inserted padding in code that should never run, the branch prediction is wrong (forward branches are assumed to be taken), the register usage and chosen instructions are a joke, etc.
<some alignment here if you like> convertSignedLongToUnsignedInt: cmp dword [esp+8],0 jne .withinBounds MOV ECX,#-67 CALL,static wrapper for: uncommon_trap INT3 ; ShouldNotReachHere
<some alignment here if you like> .withinBounds: mov eax,[esp+4] TEST PollPage,EAX ! Poll Safepoint ;Don't know what this is meant to do! :-) ret
Cheers,
Brendan
Adam Warner - 25 Mar 2006 00:21 GMT > Hi, > [quoted text clipped - 4 lines] > are assumed to be taken), the register usage and chosen instructions are > a joke, etc. I now realise it's a Catch 22. The undocumented option -XX:+PrintOptoAssembly "is not final ASM code but it's very close": <http://www.javalobby.org/java/forums/m91938827.html>
But this undocumented option is only available in the fastdebug builds. I remember reading somewhere that Sun does not have legal permission to distribute the disassembler with their release products. Thus one can only disassemble code generated by these builds: <http://blogs.sun.com/roller/page/kto?entry=mustang_jdk_6_0_fastdebug>
"So using a fastdebug build might provide some information you wouldn't get from running a product build. It is slower, but no where near as slow as a debug build. The optimization isn't as high as with the product build, but since the assert checking and debug code exists in these builds, the code isn't the same anyway."
This explains the redundant stack frame and likely invalidates any inference one can make about the quality of release build assembly code. I apologise for not appreciating this earlier.
Regards, Adam
Chris Uppal - 24 Mar 2006 11:58 GMT > I believe the unsigned right shift by 32 should permit the test to be > conducted upon the 32 most significant bits of the 64-bit value, that is > no shift should actually be performed on 32-bit platforms. I'm somewhat puzzled by this sentence. I may well be misunderstanding you but it sounds as if you assume that an int is 64-bit on a 64-bit platform or possibly that a long is 32-bit on a 32-bit platform. That's not the case: ints are 32-bit, and longs 64-bit, on every platform.
-- chris
Adam Warner - 24 Mar 2006 23:56 GMT >> I believe the unsigned right shift by 32 should permit the test to be >> conducted upon the 32 most significant bits of the 64-bit value, that [quoted text clipped - 4 lines] > platform or possibly that a long is 32-bit on a 32-bit platform. That's > not the case: ints are 32-bit, and longs 64-bit, on every platform. I had a mental model of the long being transferred in two 32-bit registers on a 32-bit platform. Let's call the registers H and L and write the long as HL. In a higher level language to obtain the 32 most significant bits of the long HL one could unsigned shift the long right by 32 and perhaps cast the result to 32 bits. But at the lower level I was hoping the compiler would say "let's just return the value of H".
Whether the value of H could be returned without shifting on a 64-bit platform could depend upon whether the architecture permits 64-bit registers to be accessed as independent 32-bit registers (which is why I made that qualification).
Regards, Adam
Roedy Green - 25 Mar 2006 01:17 GMT >I had a mental model of the long being transferred in two 32-bit registers >on a 32-bit platform. Let's call the registers H and L and write the long >as HL. In a higher level language to obtain the 32 most significant bits >of the long HL one could unsigned shift the long right by 32 and perhaps >cast the result to 32 bits. But at the lower level I was hoping the >compiler would say "let's just return the value of H". Yes, at least Jet does just that.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Adam Warner - 25 Mar 2006 04:23 GMT >>I had a mental model of the long being transferred in two 32-bit registers >>on a 32-bit platform. Let's call the registers H and L and write the long [quoted text clipped - 4 lines] > > Yes, at least Jet does just that. Thanks Roedy, that's great to know! It looks like I will be able to build relatively efficient long index bounds checking upon the JVM. By only checking the H bits are zero the L check remains with the JVM (it's not duplicated).
Regards, Adam
Grumble - 24 Mar 2006 17:17 GMT > I'm trying to create the fastest way to cast a Java long to an int while > preserving array bounds checking. This is the approach I suspect should be [quoted text clipped - 14 lines] > conducted upon the 32 most significant bits of the 64-bit value, that is > no shift should actually be performed on 32-bit platforms. For what it's worth, out of curiosity, I wrote a similar function in C.
#include <stdint.h> void abort(void); int32_t foo(int64_t index) { int32_t high = (uint64_t)index >> 32; if (high != 0) abort(); return index; }
for which gcc-3.4.4 -O2 generates the following code.
_foo: pushl %ebp movl %esp, %ebp subl $8, %esp /* What for? Stack alignment? Why won't it go away with -mpreferred-stack-boundary=4 ?? */ movl 12(%ebp), %edx movl 8(%ebp), %eax testl %edx, %edx jne L4 leave ret L4: call _abort
and gcc-3.4.4 -Os -fomit-frame-pointer generates the following code.
_foo: cmpl $0, 8(%esp) movl 4(%esp), %eax je L2 call _abort L2: ret
(I'd switch je to jne and exchange call _abort and ret.)
Skarmander - 24 Mar 2006 19:04 GMT >> I'm trying to create the fastest way to cast a Java long to an int while >> preserving array bounds checking. This is the approach I suspect should be [quoted text clipped - 34 lines] > /* > What for? Stack alignment? Yes. In particular, the Pentiums and in particular SSE do not like data that's not royally aligned.
> Why won't it go away with -mpreferred-stack-boundary=4 ?? Because -mpreferred-stack-boundary is the base 2 logarithm of the number of bytes to align to, not the actual number of bytes. In this case, you've asked for a stack alignment of 16 bytes, which is the default. Try -mpreferred-stack-boundary=2.
S.
Eric Albert - 25 Mar 2006 11:03 GMT > >> I'm trying to create the fastest way to cast a Java long to an int while > >> preserving array bounds checking. This is the approach I suspect should be [quoted text clipped - 44 lines] > asked for a stack alignment of 16 bytes, which is the default. Try > -mpreferred-stack-boundary=2. As far as I know, Mac OS X is the only widely used x86 operating system to use 16-byte stack alignment by default for 32-bit. Everyone else uses 4-byte alignment. For 64-bit, though, the AMD64 ABI requires 16-byte stack alignment.
-Eric
 Signature Eric Albert ejalbert@cs.stanford.edu http://outofcheese.org/
Skarmander - 25 Mar 2006 20:58 GMT <snip>
>>> _foo: >>> pushl %ebp [quoted text clipped - 14 lines] > to use 16-byte stack alignment by default for 32-bit. Everyone else > uses 4-byte alignment. Well, it's true that, say, Windows doesn't *need* 16-byte aligment, but recent gccs use 16-byte alignment by default for x86-32. This does often raise eyebrows, but there seems to be some truth to the defense that those extra bytes are a small price to pay for avoiding the risk of performance loss when the alignment is necessary (for SSE and friends). The Pentium 3 and 4 allegedly like 16-byte alignment better as well, even without SSE (I've never tested any of this, mind you).
S.
Eric Albert - 26 Mar 2006 09:24 GMT > <snip> > >>> _foo: [quoted text clipped - 24 lines] > and 4 allegedly like 16-byte alignment better as well, even without SSE > (I've never tested any of this, mind you). Ah; you're completely right about gcc. I'd missed that it used -mpreferred-stack-boundary=4 by default when not using -Os. The difference in Apple's gcc is that -mpreferred-stack-boundary=4 is also set for -Os, since the system's ABI requires it.
-Eric
 Signature Eric Albert ejalbert@cs.stanford.edu http://outofcheese.org/
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|