Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2006

Tip: Looking for answers? Try searching our database.

Slow parsing of byte array to String

Thread view: 
Eric Razny - 24 Jan 2006 02:33 GMT
Hi.
I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
as String.

A part of the code is as :

byte buf[]=new byte[L_LENGTH];
//[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
String S=new String(buf, 20, 54, "UTF-16BE");

Of course it works well but it's painfully slow and usenet and web search
was unsuccessfull.

Does someone has a trick to do the job faster?

Thanks,

Eric
John C. Bollinger - 24 Jan 2006 04:33 GMT
> I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
> as String.
[quoted text clipped - 9 lines]
>
> Does someone has a trick to do the job faster?

It depends on how you define "the job".  I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, so you must be doing many such
decodings.  That being the case, it is unclear how you are certain that
your problem is in that constructor, as opposed to anywhere else in the
data processing procedure.

Have you profiled the application?  It is not worth your time or mine to
speculate on how to speed it up without knowing precisely where all the
cycles are going.  If you *have* profiled it then we can guide you
better based on the full results, whether or not it turns out to be the
String constructor that is eating all the time.

Signature

John Bollinger
jobollin@indiana.edu

RiCaRdO - 24 Jan 2006 06:09 GMT
Are you using the + operator on Strings?
Lots of string concatenation causes the Java garbage collector to work
over time.
try using a StringBuffer instead?

> > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
> > as String.
[quoted text clipped - 22 lines]
> better based on the full results, whether or not it turns out to be the
> String constructor that is eating all the time.
Chris Uppal - 24 Jan 2006 09:30 GMT
> It depends on how you define "the job".  I have trouble believing that
> one invocation of the constructor you are using takes a
> human-discernible amount of time, [...]

Especially since the conversion from UTF16 (BE or LE) is as near trivial as
charset decoding can possibly get.

I second John's request for more information.  If you don't have profiling data
then you must have some other timing information.  What is it ?

   -- chris
Eric Razny - 24 Jan 2006 12:28 GMT
Le Tue, 24 Jan 2006 09:30:27 +0000, Chris Uppal a écrit :

>> It depends on how you define "the job".  I have trouble believing that
>> one invocation of the constructor you are using takes a
[quoted text clipped - 5 lines]
> I second John's request for more information.  If you don't have profiling
> data then you must have some other timing information.  What is it ?

John, Chris, thanks for your replies.

I have parsed the application the more basically I can do : difference
between getTimeInMillis() from end to begin of prog.

I stripped out all the real stuff to keep only the parts I suspected to
be the bottleneck of the application. The remaining code is :

/////// CODE //////

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.GregorianCalendar;

public class Test{
    public static void main(String args[]){

        int sum = 0;
               String fileName="/tmp/test.bin"; //file size is 22MB :  22645840
               GregorianCalendar start=new GregorianCalendar();
        try {
                   

            FileInputStream fs = new FileInputStream(fileName);
            FileChannel fchan = null;
            fchan = fs.getChannel();
            int sz = (int)fchan.size();
            MappedByteBuffer mbuf = null;
            mbuf =
                fchan.map(FileChannel.MapMode.READ_ONLY,
                0, sz);
   
            byte tpb = 0;
            ByteBuffer ibuf = mbuf.asReadOnlyBuffer();
                       byte tmpBuf[]=new byte[212];
            while (ibuf.hasRemaining()) {
                ibuf.get(tmpBuf);
                               String Part1=new String(tmpBuf, 20, 54, "UTF-16BE"); // comment this line out for profiling
                               String Part2=new String(tmpBuf, 74, 132, "UTF-16BE"); // comment this line out for profiling
            }   

        }catch(Exception e){e.printStackTrace();}
    System.out.println("Duration in ms : "+(new GregorianCalendar().getTimeInMillis()-start.getTimeInMillis()));
    }
}
/////// END OF CODE //////

Platform :
java version "1.5.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b08)
Java HotSpot(TM) Client VM (build 1.5.0_01-b08, mixed mode, sharing)
Linux 2.6, debian sarge, via processor & chipset.

If i launch the prog the result is :
Duration in ms : 4618  (no significant differences between tests)

Then, when i comment out the two lines flagged above the result is :
Duration in ms : 383 (no significant differences between tests)

The 106820 loops are processed faster on... faster computers :) but I must
keep this platform. Since the 4+ seconds must be added to uncompressible
other ones and a human is waiting for a result i'll appreciate any
help to reduce this bottleneck.

Eric

PS : sorry for the 3 lines of more than 80 chars.
Chris Uppal - 24 Jan 2006 15:24 GMT
> If i launch the prog the result is :
> Duration in ms : 4618  (no significant differences between tests)
>
> Then, when i comment out the two lines flagged above the result is :
> Duration in ms : 383 (no significant differences between tests)

That's odd.  I've just created a 22,645,840 byte test file which is just the
string 0123456789 repeated enough times (as UTF16BE) to fill the file (since
the conversion from UTF16BE is trivial it shouldn't matter what the actual
characters are).

With that, and using JDK 1.5.0, the test runs on this 1.5 GHz Win XP machine in
about 1.1 seconds.  If I comment out the string creation lines then it runs in
about 0.46 seconds.  (Or, if I run under JDK 1.4.2, in about 0.2 seconds)

Trying again using JDK 1.4.2 on a 2.4 GHz SuSE 8 box takes around 1.3 and 0.2
seconds respectively.  Pretty much the same as 1..4.2 on the Windows box.

The performance I'm seeing seems quite reasonable to me.  I have no idea at all
why the decoding is taking so much longer on your machine.

If the choice of character encoding cannot change, and if the data is under
your control (so you know it is always valid without checking) then you /might/
be able to avoid the strange bottleneck by hardcoding your own UTF16-BE
decoder.  Something like the following:

// hardwired UTF16-BE decoding
// Note:
// This DOES NOT CHECK for invalid UTF16 sequences
//    and so may allow security breaches.
//  This does not check that byteCount makes sense.
//  This HAS NOT BEEN TESTED on non-ASCII input.
private String
decodeUTF16BE(byte[] bytes, int startByte, int byteCount)
{
 int charCount = byteCount / 2;
 StringBuilder builder = new StringBuilder(charCount);
 for (int i = 0; i < charCount; i++)
 {
  byte low = bytes[startByte++];
  byte high = bytes[startByte++];
  int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
  builder.append((char)ch);
 }

 return builder.toString();
}

For me, that's about 2x faster than going through the system-supplied code (in
part, I assume, because of the missing checks, and possibly also because I've
got it wrong).

Oh, BTW, you can get the current time in milliseconds straight from
System.currentTimeMillis() without having to mess around with
GregorianCalendars.  You can also (since 1.5) get a finer-grained timer with
System.nanoTime()

   -- chris
Eric Razny - 24 Jan 2006 17:33 GMT
Le Tue, 24 Jan 2006 15:25:15 +0000, Chris Uppal a écrit :

[Snap Speed is better on your box than on mine]

The box has a Via chipset at 600 Mhz :)
Yours is simply faster, but the ratio still sounds bad to me.

> If the choice of character encoding cannot change, and if the data is
> under your control (so you know it is always valid without checking)

That's the case. I guess the checking burns out most of the time.

> then
> you /might/ be able to avoid the strange bottleneck by hardcoding your own
[quoted text clipped - 22 lines]
> (in part, I assume, because of the missing checks, and possibly also
> because I've got it wrong).

Argh! This is the kind of thing I'm searching for... but it must still be
usable on 1.4.2 JVM :(

Many thanks anyway.

> Oh, BTW, you can get the current time in milliseconds straight from
> System.currentTimeMillis() without having to mess around with
> GregorianCalendars.  You can also (since 1.5) get a finer-grained timer
> with System.nanoTime()

    Yep, still a sequel of an old, old test with an automatic paste from my
    editor :)

Eric.
Chris Uppal - 24 Jan 2006 19:38 GMT
> Argh! This is the kind of thing I'm searching for... but it must still be
> usable on 1.4.2 JVM :(

Well, you could just change StringBuilder to StringBuffer, but -- at least for
my quick test -- the synchronised append() method is slow enough to loose most
of the gains from do-it-yourself UTF16 decoding.

Probably the easiest thing, at least to try, is to build up the data in a
char[] array instead:

 ...
 int charCount = byteCount / 2;
 char[] buffer = new char[charCount];
 for (int i = 0; i < charCount; i++)
 {
  byte low = bytes[startByte++];
  byte high = bytes[startByte++];
  int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
  buffer[i] = (char)ch;
 }

 return new String(buffer);

Somewhat to my surprise, that works (on my machine) rather faster than the
version which uses a StringBuilder -- despite the extra array allocation and
copy.  I also tried using a statically allocated buffer, to avoid the extra
allocation (but not the extra copy), but there didn't seem to be any
performance edge over the simple version (and mutable static variables are not
a good idea anyway, unless you are willing to synchronise all access to them).

I don't know whether that depends (for it's performance) on optimisations that
only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth a quick
test.  BTW, are you using the -server flag ?

   -- chris
Eric Razny - 24 Jan 2006 21:49 GMT
Le Tue, 24 Jan 2006 19:38:08 +0000, Chris Uppal a écrit :

> Well, you could just change StringBuilder to StringBuffer, but -- at least
> for my quick test -- the synchronised append() method is slow enough to
> loose most of the gains from do-it-yourself UTF16 decoding.

I lost more than that in my case! The resulting program was slower than
the original one.
   

> Probably the easiest thing, at least to try, is to build up the data in a
> char[] array instead:
[quoted text clipped - 10 lines]
>   }
>   return new String(buffer);

Great! Wonderfull! Marvel.. ahem, ok, it works! :)

Total time is now 1395, not anymore 4618. If we substract the fixed 383 ms
(Yes, i know, there's a 100ms resolution timer) this give :
1012 vs 4235. A 4x gain.

> Somewhat to my surprise, that works (on my machine) rather faster than the
> version which uses a StringBuilder -- despite the extra array allocation
> and copy.  I also tried using a statically allocated buffer, to avoid the
> extra allocation (but not the extra copy), but there didn't seem to be any
> performance edge over the simple version

Tried the same thing on my computer with same result, little if any
improvment. I didn't try side effects with "uncommon" UTF-16 chars but
it's perfect for me as I know what can be stored in the file.

> (and mutable static variables are
> not a good idea anyway, unless you are willing to synchronise all access
> to them).

And loose even more time? :)

> I don't know whether that depends (for it's performance) on optimisations
> that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth
> a quick test.  BTW, are you using the -server flag ?

For some distribution reasons no. But for this short code the -server
flag results in a 100ms penality! It seems I need to "unlearn" some
obvious things about java :)

Many Thanks for you help.

Eric


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.