Java Forum / General / May 2007
MD5 algorithm
Christoph Burschka - 14 May 2007 11:38 GMT I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an existing MD5 library; this is an exercise). It compiles and returns something that looks like a hash superficially, but that's clearly the wrong result. Even if there were a Unicode/Ascii problem, the empty string should return the correct result (since it is always padded to one 1-bit and 511 0-bits), but it doesn't.
Also, several strings (say, "" and "The") return the same incorrect hash - but by dumping the 128-bit hash in each iteration, I've found only very little in common between the intermediate steps, only the final one is completely identical.
This is the function, as implemented from the pseudocode on Wikipedia. http://en.wikipedia.org/wiki/MD5#Pseudocode
--------------------------------------------------------
Constants:
int[4] s = { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476 }; int[64] shift= { 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21 };
int[64] k;
for (int i=0;i<64;i++) { k[i]=(int)(Math.abs(Math.sin(i+1))*1073741824*4); }
private static int[] processFin(int[] message) { int[] result=s.clone(); // initialize // message is already padded, just split: int blocks=message.length/16; for (int i=0;i<blocks;i++) { // for each 16-int block int[] state=result.clone(); // copy state for (int j=0;j<64;j++) { int f,g,temp; if (j<16) { f= state[1] & state[2] | ~state[1] & state[3]; g=j; } else if (j<32) { f=state[3] & state[1] | ~state[3] & state[2]; g=(5*j+1)%16; } else if (j<48) { f=state[1]^state[2]^state[3]; g=(3*j+5)%16; } else { f=state[2]^ (state[1] | ~state[3]); g=(7*j)%16; } temp=state[3]; state[3]=state[2]; state[2]=state[1]; state[1]=(state[0]+f+k[j]+message[i*16+g]) << shift[j]; state[0]=temp; } for(int j=0;j<4;j++) result[i]+=state[i]; } return result; }
--------------------------------------------- The hex string returned for both "" and "The" is 3c00a101efcdab8998badcfe10325476.
I just can't see what's wrong with the code - all parts of the pseudocode description seem to be implemented exactly as they should be... unless the signed ints don't wrap modulo 2^32 as unsigned ones should. But they do for all examples I could think of, and anyway, that shouldn't cause those hashes to be exactly the same...
--cb
sebbaer - 14 May 2007 12:03 GMT Her i have found some intressting webpages about it. Maybe it helps.
http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
Christoph Burschka - 14 May 2007 12:40 GMT sebbaer schrieb:
> Her i have found some intressting webpages about it. Maybe it helps. > > http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html If I had found something that helped me in the first 10 Google results, I wouldn't have had to ask here, now would I? :)
The only thing I could use would be some kind of step-by-step walkthrough for the MD5 calculation of a short string, so I could check it against the intermediate steps in my own calculation
--cb
Roedy Green - 14 May 2007 14:21 GMT On Mon, 14 May 2007 13:40:38 +0200, Christoph Burschka <christoph.burschka@rwth-aachen.de> wrote, quoted or indirectly quoted someone who said :
>The only thing I could use would be some kind of step-by-step >walkthrough for the MD5 calculation of a short string, so I could check >it against the intermediate steps in my own calculation Can you find Sun's code? Do you have an assembler level bebugger than will let you single step through the code? --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 14 May 2007 15:14 GMT On Mon, 14 May 2007 13:40:38 +0200, Christoph Burschka <christoph.burschka@rwth-aachen.de> wrote, quoted or indirectly quoted someone who said :
>The only thing I could use would be some kind of step-by-step >walkthrough for the MD5 calculation of a short string, so I could check >it against the intermediate steps in my own calculation --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Oliver Wong - 14 May 2007 17:29 GMT > sebbaer schrieb: >> Her i have found some intressting webpages about it. Maybe it helps. [quoted text clipped - 3 lines] > If I had found something that helped me in the first 10 Google results, > I wouldn't have had to ask here, now would I? :) You'd be surprised...
- Oliver
Filip Larsen - 14 May 2007 16:29 GMT > private static int[] processFin(int[] message) > { [quoted text clipped - 3 lines] > for (int i=0;i<blocks;i++) > { // for each 16-int block It seems you are skipping the last 16 element in messages. Perhaps the "i < blocks" condition should have been "i <= blocks"?
Regards,
 Signature Filip Larsen
Christoph Burschka - 14 May 2007 22:24 GMT >> private static int[] processFin(int[] message) >> { [quoted text clipped - 8 lines] > > Regards, Mh... you had me there for a few moments. But no, the message is already padded to a multiple of 16, so blocks is exactly the right number of blocks (if the message has 32 elements, blocks will be 2, and the loop will go from block #0 to block #1).
If i is the number of the block, the messages in the block are message[16*i .. 16*i+15]. Since blocks is exactly the message length divided by 16, the loop I have now goes from message[0] to message[(message.length / 16 -1) * 16 +15], or message[message.length-1]. If I changed the boundary condition to i<=blocks, I'd be running out of the array boundary.
--cb
Filip Larsen - 14 May 2007 23:12 GMT > Mh... you had me there for a few moments. But no, the message is already padded > to a multiple of 16, so blocks is exactly the right number of blocks (if the > message has 32 elements, blocks will be 2, and the loop will go from block #0 to > block #1). Ok, hard to discount that without seeing the padding code.
> for(int j=0;j<4;j++) result[i]+=state[i]; Next bug candidate is this line, where it looks like the arrays should have been indexed by j and not i.
Regards,
 Signature Filip Larsen
Christoph Burschka - 15 May 2007 01:06 GMT >> Mh... you had me there for a few moments. But no, the message is >> already padded [quoted text clipped - 12 lines] > > Regards, Wow... that thing is literally crawling with bugs. Unfortunately, that still didn't do the trick - now the result is different again, but the hashes are still wrong and also nearly identical:
[]: 0112a34159cdab8981f7dcfea9eb2476 [ ]: 0112a34159cdab8981f7dcfea9eb2476 [The]: 8112a34159cdab8981f7dcfea9eb2476 [ttr]: 8112a34159cdab8981f7dcfea9eb2476 [ t t]: 7512a34159cdab8981f7dcfea9eb2476 [The quick brown fox jumps over the lazy dog]: f3b76b417dcdab89cf30dcfe97eb2476
I think I'll sleep on it and look at the code again when I'm more awake. Thanks very much for the help! :)
-- cb
rossum - 14 May 2007 16:44 GMT >I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an >existing MD5 library; this is an exercise). It compiles and returns [quoted text clipped - 12 lines] > >-------------------------------------------------------- [snip]
>private static int[] processFin(int[] message) >{ > int[] result=s.clone(); // initialize > // message is already padded, just split: Have you implemented the padding correctly? To quote RFC 1321: "3.1 Step 1. Append Padding Bits
The message is "padded" (extended) so that its length (in bits) is congruent to 448, modulo 512. That is, the message is extended so that it is just 64 bits shy of being a multiple of 512 bits long. Padding is always performed, even if the length of the message is already congruent to 448, modulo 512.
Padding is performed as follows: a single "1" bit is appended to the message, and then "0" bits are appended so that the length in bits of the padded message becomes congruent to 448, modulo 512. In all, at least one bit and at most 512 bits are appended.
3.2 Step 2. Append Length
A 64-bit representation of b (the length of the message before the padding bits were added) is appended to the result of the previous step. In the unlikely event that b is greater than 2^64, then only the low-order 64 bits of b are used. (These bits are appended as two 32-bit words and appended low-order word first in accordance with the previous conventions.)
At this point the resulting message (after padding with bits and with b) has a length that is an exact multiple of 512 bits. Equivalently, this message has a length that is an exact multiple of 16 (32-bit) words. Let M[0 ... N-1] denote the words of the resulting message, where N is a multiple of 16."
Note the endianness of the 64 bit message length field, which is not clearly explained in the Wikipedia article - two big-endian 32 bit words in little-endian order. This is not a standard byte order and you will need to implement it carefully.
It is also necessary to get all endian issues resolved correctly within the algorithm itself - they can often be a source of problems when implementing cryptographic algorithms.
RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a canonical implementation of MD5 in C, which might be helpful to read. It also contains a few more Test Vectors than the Wikipedia article.
rossum
Christoph Burschka - 14 May 2007 22:37 GMT > Note the endianness of the 64 bit message length field, which is not > clearly explained in the Wikipedia article - two big-endian 32 bit [quoted text clipped - 10 lines] > > rossum Whoops. I didn't get the little-endian thing. So far I just padded my int[] array to a length of 14 mod 16, then added the length value (long) as two int words in normal big-endian order.
So what I should do is just swap the order of the two ints that I split the long value into, leaving the ints themselves unchanged.
Mh. But that can't be my only mistake. Consider the zero-length example - "" will always be padded to 0x8000[...]00, and the unpadded length, 0, is the same regardless of endianness. Still, this message gives me the wrong result.
Not even mentioning that I'm somewhat skeptic of the possibility that the order of two 32bit words at the end of the message could so badly break the hashing function as to give it the same hash for two random short messages... but I'll fix this and see what happens.
--cb
Christoph Burschka - 14 May 2007 23:25 GMT >> Note the endianness of the 64 bit message length field, which is not >> clearly explained in the Wikipedia article - two big-endian 32 bit [quoted text clipped - 28 lines] > > --cb Whoops again. There's a lot more wrong with my padding than I thought. I appear to be adding the message length some 8 words before the actual end, and I don't even know how I managed to do that.
Also, the initialization of my sine table is messed up - casting from double to int will apparently just set too-high values to INT_MAX_VALUE. But if I cast the sine to int before multiplying it by 2^32, well... it just ends up at 0, of course. I'll need to do something else here...
Aha! Casting first to long to get a whole number, then to int, appears to give me something closer to what I want - the long->int conversion wraps instead of setting to INT_MAX_VALUE. Whether the result is also the correct one remains to be seen...
--cb
Christoph Burschka - 15 May 2007 10:02 GMT Mh... several more problems fixed. For example, I missed another Endianness reversal: The words of the message itself are processed with the lowest byte first.
My fix doesn't look too pretty, but I think it's the fastest it gets without heavily optimizing:
private static int littleEndian(int in) { int out=0; if (in<0) { in-=Integer.MIN_VALUE; out+=128; } for (int i=0;i<4;i++) { out+=in%256*Math.pow(256,3-i); in/=256; } return out; }
Also, I missed a crucial part of the algorithm:
b := ((a + f + k[i] + w[g]) leftrotate r[i]) + b
And finally, it appears that the bitwise shift operator isn't doing what I thought it would - no rotation; bits to the left are removed, 0s are added to the right. Too bad - I'd hoped I could use the built-in << operator; now I need a new function for it. (i=i*2+(i<0?1:0)) is reasonably elegant for this, however.
Oh, and the final problem: I forgot to reverse the Endianness of the hash result after calculation.
With all this, I am now finally getting the correct result for the zero-length message (d41d8cd98f00b204e9800998ecf8427e). I'm still getting incorrect results for actual messages, but they have nothing in common and I can be reasonably sure that's because of incorrect padding or character encoding, not the algorithm.
-- cb
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|