Java Forum / General / January 2006
Slow parsing of byte array to String
Eric Razny - 24 Jan 2006 02:33 GMT Hi. I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string) as String.
A part of the code is as :
byte buf[]=new byte[L_LENGTH]; //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings String S=new String(buf, 20, 54, "UTF-16BE");
Of course it works well but it's painfully slow and usenet and web search was unsuccessfull.
Does someone has a trick to do the job faster?
Thanks,
Eric
John C. Bollinger - 24 Jan 2006 04:33 GMT > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string) > as String. [quoted text clipped - 9 lines] > > Does someone has a trick to do the job faster? It depends on how you define "the job". I have trouble believing that one invocation of the constructor you are using takes a human-discernible amount of time, so you must be doing many such decodings. That being the case, it is unclear how you are certain that your problem is in that constructor, as opposed to anywhere else in the data processing procedure.
Have you profiled the application? It is not worth your time or mine to speculate on how to speed it up without knowing precisely where all the cycles are going. If you *have* profiled it then we can guide you better based on the full results, whether or not it turns out to be the String constructor that is eating all the time.
 Signature John Bollinger jobollin@indiana.edu
RiCaRdO - 24 Jan 2006 06:09 GMT Are you using the + operator on Strings? Lots of string concatenation causes the Java garbage collector to work over time. try using a StringBuffer instead?
> > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string) > > as String. [quoted text clipped - 22 lines] > better based on the full results, whether or not it turns out to be the > String constructor that is eating all the time. Chris Uppal - 24 Jan 2006 09:30 GMT > It depends on how you define "the job". I have trouble believing that > one invocation of the constructor you are using takes a > human-discernible amount of time, [...] Especially since the conversion from UTF16 (BE or LE) is as near trivial as charset decoding can possibly get.
I second John's request for more information. If you don't have profiling data then you must have some other timing information. What is it ?
-- chris
Eric Razny - 24 Jan 2006 12:28 GMT Le Tue, 24 Jan 2006 09:30:27 +0000, Chris Uppal a écrit :
>> It depends on how you define "the job". I have trouble believing that >> one invocation of the constructor you are using takes a [quoted text clipped - 5 lines] > I second John's request for more information. If you don't have profiling > data then you must have some other timing information. What is it ? John, Chris, thanks for your replies.
I have parsed the application the more basically I can do : difference between getTimeInMillis() from end to begin of prog.
I stripped out all the real stuff to keep only the parts I suspected to be the bottleneck of the application. The remaining code is :
/////// CODE //////
import java.io.FileInputStream; import java.nio.MappedByteBuffer; import java.nio.ByteBuffer; import java.nio.channels.FileChannel; import java.util.GregorianCalendar;
public class Test{ public static void main(String args[]){
int sum = 0; String fileName="/tmp/test.bin"; //file size is 22MB : 22645840 GregorianCalendar start=new GregorianCalendar(); try {
FileInputStream fs = new FileInputStream(fileName); FileChannel fchan = null; fchan = fs.getChannel(); int sz = (int)fchan.size(); MappedByteBuffer mbuf = null; mbuf = fchan.map(FileChannel.MapMode.READ_ONLY, 0, sz); byte tpb = 0; ByteBuffer ibuf = mbuf.asReadOnlyBuffer(); byte tmpBuf[]=new byte[212]; while (ibuf.hasRemaining()) { ibuf.get(tmpBuf); String Part1=new String(tmpBuf, 20, 54, "UTF-16BE"); // comment this line out for profiling String Part2=new String(tmpBuf, 74, 132, "UTF-16BE"); // comment this line out for profiling }
}catch(Exception e){e.printStackTrace();} System.out.println("Duration in ms : "+(new GregorianCalendar().getTimeInMillis()-start.getTimeInMillis())); } } /////// END OF CODE //////
Platform : java version "1.5.0_01" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08) Java HotSpot(TM) Client VM (build 1.5.0_01-b08, mixed mode, sharing) Linux 2.6, debian sarge, via processor & chipset.
If i launch the prog the result is : Duration in ms : 4618 (no significant differences between tests)
Then, when i comment out the two lines flagged above the result is : Duration in ms : 383 (no significant differences between tests)
The 106820 loops are processed faster on... faster computers :) but I must keep this platform. Since the 4+ seconds must be added to uncompressible other ones and a human is waiting for a result i'll appreciate any help to reduce this bottleneck.
Eric
PS : sorry for the 3 lines of more than 80 chars.
Chris Uppal - 24 Jan 2006 15:24 GMT > If i launch the prog the result is : > Duration in ms : 4618 (no significant differences between tests) > > Then, when i comment out the two lines flagged above the result is : > Duration in ms : 383 (no significant differences between tests) That's odd. I've just created a 22,645,840 byte test file which is just the string 0123456789 repeated enough times (as UTF16BE) to fill the file (since the conversion from UTF16BE is trivial it shouldn't matter what the actual characters are).
With that, and using JDK 1.5.0, the test runs on this 1.5 GHz Win XP machine in about 1.1 seconds. If I comment out the string creation lines then it runs in about 0.46 seconds. (Or, if I run under JDK 1.4.2, in about 0.2 seconds)
Trying again using JDK 1.4.2 on a 2.4 GHz SuSE 8 box takes around 1.3 and 0.2 seconds respectively. Pretty much the same as 1..4.2 on the Windows box.
The performance I'm seeing seems quite reasonable to me. I have no idea at all why the decoding is taking so much longer on your machine.
If the choice of character encoding cannot change, and if the data is under your control (so you know it is always valid without checking) then you /might/ be able to avoid the strange bottleneck by hardcoding your own UTF16-BE decoder. Something like the following:
// hardwired UTF16-BE decoding // Note: // This DOES NOT CHECK for invalid UTF16 sequences // and so may allow security breaches. // This does not check that byteCount makes sense. // This HAS NOT BEEN TESTED on non-ASCII input. private String decodeUTF16BE(byte[] bytes, int startByte, int byteCount) { int charCount = byteCount / 2; StringBuilder builder = new StringBuilder(charCount); for (int i = 0; i < charCount; i++) { byte low = bytes[startByte++]; byte high = bytes[startByte++]; int ch = ((low & 0xFF) << 8 ) | (high & 0xFF); builder.append((char)ch); }
return builder.toString(); }
For me, that's about 2x faster than going through the system-supplied code (in part, I assume, because of the missing checks, and possibly also because I've got it wrong).
Oh, BTW, you can get the current time in milliseconds straight from System.currentTimeMillis() without having to mess around with GregorianCalendars. You can also (since 1.5) get a finer-grained timer with System.nanoTime()
-- chris
Eric Razny - 24 Jan 2006 17:33 GMT Le Tue, 24 Jan 2006 15:25:15 +0000, Chris Uppal a écrit :
[Snap Speed is better on your box than on mine]
The box has a Via chipset at 600 Mhz :) Yours is simply faster, but the ratio still sounds bad to me.
> If the choice of character encoding cannot change, and if the data is > under your control (so you know it is always valid without checking) That's the case. I guess the checking burns out most of the time.
> then > you /might/ be able to avoid the strange bottleneck by hardcoding your own [quoted text clipped - 22 lines] > (in part, I assume, because of the missing checks, and possibly also > because I've got it wrong). Argh! This is the kind of thing I'm searching for... but it must still be usable on 1.4.2 JVM :(
Many thanks anyway.
> Oh, BTW, you can get the current time in milliseconds straight from > System.currentTimeMillis() without having to mess around with > GregorianCalendars. You can also (since 1.5) get a finer-grained timer > with System.nanoTime() Yep, still a sequel of an old, old test with an automatic paste from my editor :)
Eric.
Chris Uppal - 24 Jan 2006 19:38 GMT > Argh! This is the kind of thing I'm searching for... but it must still be > usable on 1.4.2 JVM :( Well, you could just change StringBuilder to StringBuffer, but -- at least for my quick test -- the synchronised append() method is slow enough to loose most of the gains from do-it-yourself UTF16 decoding.
Probably the easiest thing, at least to try, is to build up the data in a char[] array instead:
... int charCount = byteCount / 2; char[] buffer = new char[charCount]; for (int i = 0; i < charCount; i++) { byte low = bytes[startByte++]; byte high = bytes[startByte++]; int ch = ((low & 0xFF) << 8 ) | (high & 0xFF); buffer[i] = (char)ch; }
return new String(buffer);
Somewhat to my surprise, that works (on my machine) rather faster than the version which uses a StringBuilder -- despite the extra array allocation and copy. I also tried using a statically allocated buffer, to avoid the extra allocation (but not the extra copy), but there didn't seem to be any performance edge over the simple version (and mutable static variables are not a good idea anyway, unless you are willing to synchronise all access to them).
I don't know whether that depends (for it's performance) on optimisations that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth a quick test. BTW, are you using the -server flag ?
-- chris
Eric Razny - 24 Jan 2006 21:49 GMT Le Tue, 24 Jan 2006 19:38:08 +0000, Chris Uppal a écrit :
> Well, you could just change StringBuilder to StringBuffer, but -- at least > for my quick test -- the synchronised append() method is slow enough to > loose most of the gains from do-it-yourself UTF16 decoding. I lost more than that in my case! The resulting program was slower than the original one.
> Probably the easiest thing, at least to try, is to build up the data in a > char[] array instead: [quoted text clipped - 10 lines] > } > return new String(buffer); Great! Wonderfull! Marvel.. ahem, ok, it works! :)
Total time is now 1395, not anymore 4618. If we substract the fixed 383 ms (Yes, i know, there's a 100ms resolution timer) this give : 1012 vs 4235. A 4x gain.
> Somewhat to my surprise, that works (on my machine) rather faster than the > version which uses a StringBuilder -- despite the extra array allocation > and copy. I also tried using a statically allocated buffer, to avoid the > extra allocation (but not the extra copy), but there didn't seem to be any > performance edge over the simple version Tried the same thing on my computer with same result, little if any improvment. I didn't try side effects with "uncommon" UTF-16 chars but it's perfect for me as I know what can be stored in the file.
> (and mutable static variables are > not a good idea anyway, unless you are willing to synchronise all access > to them). And loose even more time? :)
> I don't know whether that depends (for it's performance) on optimisations > that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth > a quick test. BTW, are you using the -server flag ? For some distribution reasons no. But for this short code the -server flag results in a 100ms penality! It seems I need to "unlearn" some obvious things about java :)
Many Thanks for you help.
Eric
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|