Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2006

Tip: Looking for answers? Try searching our database.

Why my java code is THAT slow compared too C++?

Thread view: 
Kevin - 11 Feb 2006 06:50 GMT
C++: 94 seconds.
my java: 227 seconds.
Time including all (CPU, IO, etc).

Task: read in data line by line (required), and find the word frequency
count from a large file (1.2G).
A c++ code (I do not have its source), it takes total 94 seconds, using
constant 600k memory as reported from Win TaskManager.

My code (I comment out some lines and run the code several times to get
these statistics):

1) Scan the data file line by line and convert each line to a customed
"raw string" class, total use 60 seconds. I use InputStream and read
byte[] of 32k each time (the data file is plain ascii file). And do my
own "NextLine()" function from the byte[]. In the follow code, this is
done by comment out lines from Line A to Line C.

2) on top of 1), adding splitting each line into words, use another 40
seconds. (total 100 seconds). In the follow code, this is done by
comment out lines from Line B to Line C.

3) on top of 2), adding a hashtable (hashmap similar result) frequency
count, use another 117 seconds (so total is 227 seconds).

Reading 1.2G plain text file line by line using 60 seconds seems not
that bad.

Splitting each line (byte[]) into words, which are many user-defined
"raw string" class, using 40 seconds, not that bad.

However, hash counting those words needs another 127 seconds,
hum..........

So:
Number 1 issue: counting the words using hashtable takes soooo long
time. What is the problem?
Number 2 issue: the split() function to split a byte[] seems not good
enough.

I list the codes below. Suggestions greatly appreciate. :-)

Have a nice weekend!

///================================================================
Several major classes are:

My own raw string, which is basically a byte[] array.

public class KDRawString
{
 private
     byte[] Data = null;
 public int hashCode()
 {
   int c = 1;
   int v = 31;
   for (int i=0; i<Data.length; i++)
   {
     c = c + Data[i]*v;
     v = v*31;
   };
   return c;
 }
 public boolean equals(Object o)
 {
   if (o instanceof KDRawString)
   {
     KDRawString K = (KDRawString) o;
     if (K.Data.length != this.Data.length)
       return false;
     for (int i=0; i<this.Data.length; i++)
     {
       if (this.Data[i] != K.Data[i])
         return false;
     };
     return true;
   }
   else
     return false;
 }
 public KDRawString(final byte[] data)
 {
   Data = Utility.Duplicate(data);
  // Utility.Duplicate is just like System.arraycopy.
 }
 public KDRawString(byte[] data, final boolean ReUse)
 {
   if (ReUse) // sometimes, we do not need to create new array, if we
can use the input one.
     Data = data;
   else
     Data = Utility.Duplicate(data);
    // Utility.Duplicate is just like System.arraycopy.
 }
 //////////
 public KDRawString[] split(final char c)
 {
   // basically just call the static split, and convert each byte[]
into KDRawString.
   byte[][] b = split(Data, c);
   KDRawString[] R = new KDRawString[b.length];
   for (int i=0; i<b.length; i++)
   {
     R[i] = new KDRawString(b[i], true);
     // we can reuse the byte[] here since they are newly created and
not used any where else. This saved some time in my test run.
   };
   return R;
 }
 //
 public static byte[][] split(final byte[] data, final char c)
 {
   // split it by char c.
   // first, decide how many segment we need.
   int count = 0;
   for (int i=0; i<data.length; i++)
     if (data[i] == c)
       count ++;
   //
   byte[][] result = new byte[count+1][];
   // then split it by char c.
   int resultIndex = 0;
   int prePos = 0;
   for (int i=0; i<data.length; i++)
   {
     if (data[i] == c)
     {
       byte[] b = Utility.Duplicate(data, prePos, i-prePos);
       prePos = i+1;
       result[resultIndex] = b;
       resultIndex ++;
     };
   };
   byte[] b = Utility.Duplicate(data, prePos, data.length-prePos);
   result[resultIndex] = b;
   //
   return result;
 }
 ////////
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Another class is KDInt, which is just a int, used as counter (so I do
not need to new many Integer in the counting process).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
for the main function, it is:

   Hashtable ht = new Hashtable(10000); // total number of words is
about 4000.
   // Using HashMap can save 1-2 seconds, as I have tested. So not a
major problem.
   KDASCIITextFileReader kr = new KDASCIITextFileReader(fileName);
   // This is a class to read in plain text file line by line as
byte[].
   byte[] b = kr.NextLine_AsBytes();
   int lineCount = 0;
   while (b != null)
   {
     lineCount++;
     //
     KDRawString ks = new KDRawString(b);
     //< -------------------------------------------------  Line A
     KDRawString[] K = ks.split('\t');
     //< -------------------------------------------------  Line B
     for (int i=1; i<K.length; i++)
     {
       KDInt c = (KDInt) ht.get(K[i]);
       if (c == null)
       {
         c = new KDInt(0);
         ht.put(K[i], c);
       };
       c.IncreaseValue();
     };
     //< -------------------------------------------------  Line C
     b = kr.NextLine_AsBytes();
   };

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// KDASCIITextFileReader is a class to read in text file line by line
into byte[].
public class KDASCIITextFileReader
{
 private
     boolean hasError = false;
 private
     InputStream is = null;
 private
     byte[] DataBuffer = null;
 private
     int DataBufferReadCount = -1; // bytes in the DataBuffer,
 private
     int DataBufferPointer = -1; // file seek pointer.
 private
     int initialBufferSize = 32768; // 32K.
 public byte[] NextLine_AsBytes()
 {
   byte[] b = NextLine_PossibleBlank_AsBytes();
   while ((b != null) && (b.length ==0) )
     b = NextLine_PossibleBlank_AsBytes();
   //
   return b;
 }

 public byte[] NextLine_PossibleBlank_AsBytes()
 {
   // break at both \r and \n
   // UNIX is \n, Win is \r\n, Apple is \r
   //
   if (is == null)
     return null;
   //
   byte[] R = null;  // bytes to return.
   while (true)
   {
     int seekPos = DataBufferPointer;
     while (seekPos < DataBufferReadCount)
     {
       if ( (DataBuffer[seekPos] == '\r') || (DataBuffer[seekPos] ==
'\n')  )
       {
         byte[] b = Utility.Append(R, DataBuffer, DataBufferPointer,
seekPos-DataBufferPointer);
         DataBufferPointer = seekPos+1;
         return b;
       };
       seekPos++;
     };
     // so need to refill. add current to R before that.
     R = Utility.Append(R, DataBuffer, DataBufferPointer,
DataBufferReadCount-DataBufferPointer);
     fillBuffer();
     if (DataBufferReadCount <0)
     {
       // end of file.
       close();
       return R;
     };
   }
 }
 public KDASCIITextFileReader(final String FileName)
 {
   initFile(FileName);
 }
 ////////////////////
 private void initFile(final String FileName)
 {
   hasError = false;
   try
   {
     File f = new File(FileName);
     FileInputStream fis = new FileInputStream(f);
     //
     String fnUpper = FileName.toLowerCase();
     is = fis;
     DataBuffer = new byte[initialBufferSize];
     //
     fillBuffer();
   }
   catch (Exception e)
   {
     e.printStackTrace();
     setError(e.toString());
   };
 }
 ////////////////////
 private void fillBuffer()
 {
   try
   {
     DataBufferReadCount = is.read(DataBuffer);
     DataBufferPointer = 0;
   }
   catch (Exception e)
   {
     e.printStackTrace();
     setError(e.toString());
   }
 }
 public void close()
 {
   try
   {
     if (is != null)
       is.close();
     is = null;
     DataBufferReadCount = -1;
     DataBufferPointer = -1;
   }
   catch (Exception e)
   {
     e.printStackTrace();
     setError(e.toString());
   };
 }
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// End of post.
Richard Wheeldon - 11 Feb 2006 11:17 GMT
> C++: 94 seconds.
> my java: 227 seconds.
> Time including all (CPU, IO, etc).

A few points:

1. Java is slow. Get used to it. You can save so much in
development time compared to C that the performance
difference pales by comparison.

2. Get a profiler. http://mindprod.com/jgloss/profiler.html
This will answer your specific issues here.

3. By convention Java class names start with Capital letters - LikeThis
methods start with small ones - likeThis().

4. Class and methods comments are best done as Javadoc comments.

Richard
Jeffrey Schwab - 11 Feb 2006 12:43 GMT
>>C++: 94 seconds.
>>my java: 227 seconds.
[quoted text clipped - 3 lines]
>
> 1. Java is slow. Get used to it.

That depends on the application.  For something as small as the OP's
example (which you snipped), Java's startup time is significant.  For a
long-running, I/O bound, multi-tiered application, I think the
performance difference would dwindle.

> You can save so much in
> development time compared to C that the performance
> difference pales by comparison.

Who said anything about C?  The OP is using C++.  These are closely
related but very different languages.

One of my pet peeves is the way people still compare languages to C to
show how much better they are for high-level development.  C is not the
standard for developer productivity, and has not been for quite some
time.  On the other hand, try integrating Java into a 1-MB PC BIOS
written mostly in assembly language.

> 2. Get a profiler. http://mindprod.com/jgloss/profiler.html
> This will answer your specific issues here.
[quoted text clipped - 3 lines]
>
> 4. Class and methods comments are best done as Javadoc comments.

Agree with all three of these points. :)
Roedy Green - 11 Feb 2006 18:20 GMT
On Sat, 11 Feb 2006 11:17:50 +0000, Richard Wheeldon
<richard@rswheeldon.com> wrote, quoted or indirectly quoted someone
who said :

>1. Java is slow. Get used to it. You can save so much in
>development time compared to C that the performance
>difference pales by comparison.

People who code in Java as if it were C write slow code.  People who
insist on using and talking about 10 year old JVMs experience slow
code. People who write code with their eyes glued shut as to what is
quick and what is not will write slow code, at least until they take
time to optimise it. Try the latest JVMs and AOTs.   Learn to do I/O
in big chunks or a file at a time efficiently.  You will be surprised
how snappy Java can be.

See http://mindprod.comjgloss/jdk.html
http://mindprod.com/jgloss/aot.html

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

tom fredriksen - 11 Feb 2006 13:24 GMT
Hi, here are some comments I have:

You can not time an operation by just commenting out a step and then
rerun the program. A lot of the time taken is from other/external
effects which are not relevant to the step. What you have to do is read
the processors current millisecond counter at the start of the step you
want to test, System.getTimeMillis() should do it. Then you read the
counter again at the end of the step. Calculate the time difference and
print it out, then you have the real time of the step.

Also be ware of using operating system tools for a quick reading on the
memory usage etc. of a program, the interpretation of the numbers are
usually wrong. In linux for example, the memory number given in the ps
command represents the total of the program including all shared
libraries. That is not the number you are looking for, but rather the
size of the *your* programs allocated memory, not necessarily  all the
memory used by all libraries aswell.

As suggested by others, use a profiling tool before you start doing any
optimisations. Time and again, it has been shown that people spend time
optimising the wrong part of the program.

Further on, if you think the hashmap is the major problem, use a
different map library than what is provided in the jdk. I can suggest
several: fastutil, trove, javolution (its up to 5 times faster than the
jdk version)

here is a link to a list of different 3rd party library implementations
(open source) which you might want to use.

http://www.manageability.org/blog/stuff/open-source-java-data-structures-and-alg
orithms/view


PS.
I would also recommend trying to make your code a little more readable
for others. It does not contain comments (javadoc style), the java style
is not used and I found the code a bit difficult in it self to read.
Perhaps because there is no empty lines in between methods and blocks, I
generally use two empty lines for that.

yours sincerely

tom

> C++: 94 seconds.
> my java: 227 seconds.
[quoted text clipped - 295 lines]
> ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> // End of post.
Chris Uppal - 11 Feb 2006 15:27 GMT
> You can not time an operation by just commenting out a step and then
> rerun the program. A lot of the time taken is from other/external
> effects which are not relevant to the step.

And Jeffrey Schwab wrote:
> [...]  For something as small as the OP's
> example (which you snipped), Java's startup time is significant.  For a
> long-running, I/O bound, multi-tiered application, I think the
> performance difference would dwindle.

Your points are valid in general, but you both seem to have missed the
/numbers/ from Kevin's post:

Kevin wrote:
> C++: 94 seconds.
> my java: 227 seconds.

When we are talking about numbers like that -- specifically the /very large/
numbers (for both versions of the program) -- then the JVM startup time is
essentially irrelevant, and the "noise" in the figures from OS effects will be
negligible.

And given that the C++ version took 94 seconds, we must assume that it had to
process a great deal of data (like Kevin's other examples).  In which case it
would be almost completely disk-bound -- I/O bandwidth would dominate the
execution time.  Under the circumstances, there is no excuse for a Java program
not to be similarly limited by I/O time.  Fortunately we don't need to scrabble
around for an explanation for the embarrassing observation that it is /not/ so
limited, since Kevin has already (partially) explained it ;-)

   -- chris
Kevin - 11 Feb 2006 17:11 GMT
#######################  problem solved #####################
#######################  problem solved #####################
#######################  problem solved #####################
#######################  problem solved #####################
Thanks all for the reply. I posted another topic and explained it. I
brief it again here.

It is not my java code's prolem.
I found the detailed guide for the c++ program.
It turned out that the data file is in a format that favor that c++
code: it encodes the hash value of each word into the file already
(which I previously thought was just some part of the data).
So the c code does not need to do the hash step, only the count step.
After I took that special data format into my code, my java code runs
as fast as the c++ code.
Thank you all for the reply and advice!~ I really learned a lot from
this group. Thank you.

Below, I briefly state how I do it in case someone will find useful.
For fast file (plain ascii in my case) IO:
1) read in the file using InputStream, each time ready in 32K data into
a byte[] buffer.
2) write your own "readLine()" method, which scan in the byte[] buffer,
and return a new byte[] as a line.
3) write your own "split(char c)" method, which break one byte[] into
many byte[].

#######################  problem solved #####################
#######################  problem solved #####################
#######################  problem solved #####################
Thomas Hawtin - 11 Feb 2006 20:13 GMT
> It is not my java code's prolem.

I disagree.

> I found the detailed guide for the c++ program.
> It turned out that the data file is in a format that favor that c++
> code: it encodes the hash value of each word into the file already
> (which I previously thought was just some part of the data).

My initial reaction to reading that was: "Oh my god, you don't want to
do that."

This really should be a I/O bound problem. Adding extra data is not the
way to may it faster.

> So the c code does not need to do the hash step, only the count step.

That is not necessarily why the C code is faster.

> After I took that special data format into my code, my java code runs
> as fast as the c++ code.

You are changing more than you think you are.

I didn't look through your code first time around because it had been
made unnecessarily difficult to read.

Looking at it again my first thought was that you were repeatedly
calculating the hash code every time. When you changed the program to
read the hash code from the file, you presumably stopped doing that.
Even so, the method should not be executed frequently.

Looking at the hash algorithm, it doesn't look conventional. Effectively
what you have is:

    Sum s[i] * ((2^5 - 1) ^ i+1)

The binary representation of powers of one less than a power of two are
not as "random" as you might think, particularly in the last few binary
digits. For a hash map/table this may result in poor distribution of
entries and hence very poor performance.

The exact impact will depend upon the implementation of the hash map.
Using HashMap on Sun's 1.4.1 may well turn out to be much slower than in
1.4.0 or 1.4.2.

The importance of reading the hash code from the file was that it is
calculated with a better algorithm. That appears to have transformed the
I/O bound problem into a CPU bound solutions.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Roedy Green - 12 Feb 2006 05:03 GMT
On Sat, 11 Feb 2006 20:27:26 +0000, Thomas Hawtin
<usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone
who said :

>Looking at the hash algorithm, it doesn't look conventional. Effectively
>what you have is:
>
>     Sum s[i] * ((2^5 - 1) ^ i+1)

see some fast hashCode algorithms at
http://mindprod.com/jgloss/hashcode.html
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Luc The Perverse - 12 Feb 2006 06:16 GMT
> That is not necessarily why the C code is faster.

My initial gut reaction, and his solution both seem to hint at the idea that
he was using an inefficient algorithm.

I have coded a word frequency calculator in C++ before, and used N^2
efficiency sorting algorithms and they were VERY slow.

So I disagree - he changed one thing and now they run in about the same
amount of time as opposed to the java running more than double. To me that
seems pretty conclusive.

Signature

LTP

:)
Thomas Hawtin - 12 Feb 2006 10:59 GMT
>> That is not necessarily why the C code is faster.
>
> My initial gut reaction, and his solution both seem to hint at the idea that
> he was using an inefficient algorithm.

It is vastly more important that the hash code produce a good answer
than that it be ultra quick.

With two multiplies, the hash code computation wasn't as cheap as it
might have been. Sun's JVM seems poor at replacing constant multiplies
with shifts and adds/subtracts, although that is less important with
faster multipliers. If you replace v = v*31; with v = (v<<5) - v; and
perhaps look at two together (v = (v<<10) - 2*(v<<5) + v;) you might
understand that the low order bits are not going to be fairly distributed.

> I have coded a word frequency calculator in C++ before, and used N^2
> efficiency sorting algorithms and they were VERY slow.

Doing n operation on a hash map take O(n) time, apparently. Screw your
hash code up, and it becomes O(n^2) and your caches burn for an extra
constant time slow down.

> So I disagree - he changed one thing and now they run in about the same
> amount of time as opposed to the java running more than double. To me that
> seems pretty conclusive.

How the hash code was calculation was performed changed (almost
certainly to a slower technique). Checking the implementation of
HashMap, it will only calculate the hash map once when inserting (or
retrieving) an entry. There is no way that the time taken to calculate
the hash code is going to make any impact on performance.

The original poster didn't know about the embedded hash code at the time
of writing the original code. It seems unlikely that the file format
happened to get the same hash value. So that was a separate change.
Given the likelihood of collisions with the first algorithm, that is a
very good candidate for causing the increased performance.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Luc The Perverse - 12 Feb 2006 11:18 GMT
>>> That is not necessarily why the C code is faster.
>>
[quoted text clipped - 33 lines]
> the likelihood of collisions with the first algorithm, that is a very good
> candidate for causing the increased performance.

I see your point - I must have misunderstood the first time round.

And I agree - I don't think there is any reason to hash at all.  Just be
careful so you don't make redundant copies of strings and you should be
fine.

I used a red black tree, but any balanced tree should be ok.  I searched for
entries, if I found them I incremented their count - if I didn't find them
then I incremented nothing.

--
LTP

:)
Chris Uppal - 12 Feb 2006 11:51 GMT
> The original poster didn't know about the embedded hash code at the time
> of writing the original code. It seems unlikely that the file format
> happened to get the same hash value. So that was a separate change.
> Given the likelihood of collisions with the first algorithm, that is a
> very good candidate for causing the increased performance.

There's also the marginal effect that the OP was probably interpreting the hash
code strings as words and adding them to the table too.  Not only would that
double the number of words in the table, the words themselves would have a much
reduced distribution (all decimal digits, or some such) which might further
exacerbate any problem caused by weak hashing.

   -- chris
tom fredriksen - 11 Feb 2006 17:28 GMT
> When we are talking about numbers like that -- specifically the /very large/
> numbers (for both versions of the program) -- then the JVM startup time is
[quoted text clipped - 5 lines]
> would be almost completely disk-bound -- I/O bandwidth would dominate the
> execution time.  

This is exactly my point, plus, that if the test is performed several
times and the numbers are simply added together the then IO time is
multiplied by as many times as the numbers are added, which suddenly
makes the result exponentially larger and therefore horrendously wrong.

Measuring the time to read the entire file is one thing, measuring that
plus other steps as an accumulation of each other is a completely
different matter and would produce the wrong result.
What I said earlier still stands if one wants correct results of
performance measurements.

/tom
Roedy Green - 11 Feb 2006 18:23 GMT
On Sat, 11 Feb 2006 15:27:56 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>And given that the C++ version took 94 seconds,
most of these times you are comparing an single person writing both
programs. Usually they are expert in one language and inept in the
other. You must compare implementations by competent people, not a
hack port of one language to another.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 11 Feb 2006 18:21 GMT
On Sat, 11 Feb 2006 14:24:05 +0100, tom fredriksen
<tom@spam-me-not.net> wrote, quoted or indirectly quoted someone who
said :

>. What you have to do is read
>the processors current millisecond counter

With 1.5 there is an ever more accurate nanosecond timer. See
http://mindprod.com/jgloss/time.html
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

tom fredriksen - 11 Feb 2006 22:33 GMT
> On Sat, 11 Feb 2006 14:24:05 +0100, tom fredriksen
> <tom@spam-me-not.net> wrote, quoted or indirectly quoted someone who
[quoted text clipped - 5 lines]
> With 1.5 there is an ever more accurate nanosecond timer. See
> http://mindprod.com/jgloss/time.html

I know. But not really the point, as you are dealing with an operation
that is measured in minutes and seconds.

/tom


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.