Java Forum / General / February 2006
Why my java code is THAT slow compared too C++?
Kevin - 11 Feb 2006 06:50 GMT C++: 94 seconds. my java: 227 seconds. Time including all (CPU, IO, etc).
Task: read in data line by line (required), and find the word frequency count from a large file (1.2G). A c++ code (I do not have its source), it takes total 94 seconds, using constant 600k memory as reported from Win TaskManager.
My code (I comment out some lines and run the code several times to get these statistics):
1) Scan the data file line by line and convert each line to a customed "raw string" class, total use 60 seconds. I use InputStream and read byte[] of 32k each time (the data file is plain ascii file). And do my own "NextLine()" function from the byte[]. In the follow code, this is done by comment out lines from Line A to Line C.
2) on top of 1), adding splitting each line into words, use another 40 seconds. (total 100 seconds). In the follow code, this is done by comment out lines from Line B to Line C.
3) on top of 2), adding a hashtable (hashmap similar result) frequency count, use another 117 seconds (so total is 227 seconds).
Reading 1.2G plain text file line by line using 60 seconds seems not that bad.
Splitting each line (byte[]) into words, which are many user-defined "raw string" class, using 40 seconds, not that bad.
However, hash counting those words needs another 127 seconds, hum..........
So: Number 1 issue: counting the words using hashtable takes soooo long time. What is the problem? Number 2 issue: the split() function to split a byte[] seems not good enough.
I list the codes below. Suggestions greatly appreciate. :-)
Have a nice weekend!
///================================================================ Several major classes are:
My own raw string, which is basically a byte[] array.
public class KDRawString { private byte[] Data = null; public int hashCode() { int c = 1; int v = 31; for (int i=0; i<Data.length; i++) { c = c + Data[i]*v; v = v*31; }; return c; } public boolean equals(Object o) { if (o instanceof KDRawString) { KDRawString K = (KDRawString) o; if (K.Data.length != this.Data.length) return false; for (int i=0; i<this.Data.length; i++) { if (this.Data[i] != K.Data[i]) return false; }; return true; } else return false; } public KDRawString(final byte[] data) { Data = Utility.Duplicate(data); // Utility.Duplicate is just like System.arraycopy. } public KDRawString(byte[] data, final boolean ReUse) { if (ReUse) // sometimes, we do not need to create new array, if we can use the input one. Data = data; else Data = Utility.Duplicate(data); // Utility.Duplicate is just like System.arraycopy. } ////////// public KDRawString[] split(final char c) { // basically just call the static split, and convert each byte[] into KDRawString. byte[][] b = split(Data, c); KDRawString[] R = new KDRawString[b.length]; for (int i=0; i<b.length; i++) { R[i] = new KDRawString(b[i], true); // we can reuse the byte[] here since they are newly created and not used any where else. This saved some time in my test run. }; return R; } // public static byte[][] split(final byte[] data, final char c) { // split it by char c. // first, decide how many segment we need. int count = 0; for (int i=0; i<data.length; i++) if (data[i] == c) count ++; // byte[][] result = new byte[count+1][]; // then split it by char c. int resultIndex = 0; int prePos = 0; for (int i=0; i<data.length; i++) { if (data[i] == c) { byte[] b = Utility.Duplicate(data, prePos, i-prePos); prePos = i+1; result[resultIndex] = b; resultIndex ++; }; }; byte[] b = Utility.Duplicate(data, prePos, data.length-prePos); result[resultIndex] = b; // return result; } //////// } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Another class is KDInt, which is just a int, used as counter (so I do not need to new many Integer in the counting process). //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// for the main function, it is:
Hashtable ht = new Hashtable(10000); // total number of words is about 4000. // Using HashMap can save 1-2 seconds, as I have tested. So not a major problem. KDASCIITextFileReader kr = new KDASCIITextFileReader(fileName); // This is a class to read in plain text file line by line as byte[]. byte[] b = kr.NextLine_AsBytes(); int lineCount = 0; while (b != null) { lineCount++; // KDRawString ks = new KDRawString(b); //< ------------------------------------------------- Line A KDRawString[] K = ks.split('\t'); //< ------------------------------------------------- Line B for (int i=1; i<K.length; i++) { KDInt c = (KDInt) ht.get(K[i]); if (c == null) { c = new KDInt(0); ht.put(K[i], c); }; c.IncreaseValue(); }; //< ------------------------------------------------- Line C b = kr.NextLine_AsBytes(); };
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // KDASCIITextFileReader is a class to read in text file line by line into byte[]. public class KDASCIITextFileReader { private boolean hasError = false; private InputStream is = null; private byte[] DataBuffer = null; private int DataBufferReadCount = -1; // bytes in the DataBuffer, private int DataBufferPointer = -1; // file seek pointer. private int initialBufferSize = 32768; // 32K. public byte[] NextLine_AsBytes() { byte[] b = NextLine_PossibleBlank_AsBytes(); while ((b != null) && (b.length ==0) ) b = NextLine_PossibleBlank_AsBytes(); // return b; }
public byte[] NextLine_PossibleBlank_AsBytes() { // break at both \r and \n // UNIX is \n, Win is \r\n, Apple is \r // if (is == null) return null; // byte[] R = null; // bytes to return. while (true) { int seekPos = DataBufferPointer; while (seekPos < DataBufferReadCount) { if ( (DataBuffer[seekPos] == '\r') || (DataBuffer[seekPos] == '\n') ) { byte[] b = Utility.Append(R, DataBuffer, DataBufferPointer, seekPos-DataBufferPointer); DataBufferPointer = seekPos+1; return b; }; seekPos++; }; // so need to refill. add current to R before that. R = Utility.Append(R, DataBuffer, DataBufferPointer, DataBufferReadCount-DataBufferPointer); fillBuffer(); if (DataBufferReadCount <0) { // end of file. close(); return R; }; } } public KDASCIITextFileReader(final String FileName) { initFile(FileName); } //////////////////// private void initFile(final String FileName) { hasError = false; try { File f = new File(FileName); FileInputStream fis = new FileInputStream(f); // String fnUpper = FileName.toLowerCase(); is = fis; DataBuffer = new byte[initialBufferSize]; // fillBuffer(); } catch (Exception e) { e.printStackTrace(); setError(e.toString()); }; } //////////////////// private void fillBuffer() { try { DataBufferReadCount = is.read(DataBuffer); DataBufferPointer = 0; } catch (Exception e) { e.printStackTrace(); setError(e.toString()); } } public void close() { try { if (is != null) is.close(); is = null; DataBufferReadCount = -1; DataBufferPointer = -1; } catch (Exception e) { e.printStackTrace(); setError(e.toString()); }; } } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // End of post.
Richard Wheeldon - 11 Feb 2006 11:17 GMT > C++: 94 seconds. > my java: 227 seconds. > Time including all (CPU, IO, etc). A few points:
1. Java is slow. Get used to it. You can save so much in development time compared to C that the performance difference pales by comparison.
2. Get a profiler. http://mindprod.com/jgloss/profiler.html This will answer your specific issues here.
3. By convention Java class names start with Capital letters - LikeThis methods start with small ones - likeThis().
4. Class and methods comments are best done as Javadoc comments.
Richard
Jeffrey Schwab - 11 Feb 2006 12:43 GMT >>C++: 94 seconds. >>my java: 227 seconds. [quoted text clipped - 3 lines] > > 1. Java is slow. Get used to it. That depends on the application. For something as small as the OP's example (which you snipped), Java's startup time is significant. For a long-running, I/O bound, multi-tiered application, I think the performance difference would dwindle.
> You can save so much in > development time compared to C that the performance > difference pales by comparison. Who said anything about C? The OP is using C++. These are closely related but very different languages.
One of my pet peeves is the way people still compare languages to C to show how much better they are for high-level development. C is not the standard for developer productivity, and has not been for quite some time. On the other hand, try integrating Java into a 1-MB PC BIOS written mostly in assembly language.
> 2. Get a profiler. http://mindprod.com/jgloss/profiler.html > This will answer your specific issues here. [quoted text clipped - 3 lines] > > 4. Class and methods comments are best done as Javadoc comments. Agree with all three of these points. :)
Roedy Green - 11 Feb 2006 18:20 GMT On Sat, 11 Feb 2006 11:17:50 +0000, Richard Wheeldon <richard@rswheeldon.com> wrote, quoted or indirectly quoted someone who said :
>1. Java is slow. Get used to it. You can save so much in >development time compared to C that the performance >difference pales by comparison. People who code in Java as if it were C write slow code. People who insist on using and talking about 10 year old JVMs experience slow code. People who write code with their eyes glued shut as to what is quick and what is not will write slow code, at least until they take time to optimise it. Try the latest JVMs and AOTs. Learn to do I/O in big chunks or a file at a time efficiently. You will be surprised how snappy Java can be.
See http://mindprod.comjgloss/jdk.html http://mindprod.com/jgloss/aot.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
tom fredriksen - 11 Feb 2006 13:24 GMT Hi, here are some comments I have:
You can not time an operation by just commenting out a step and then rerun the program. A lot of the time taken is from other/external effects which are not relevant to the step. What you have to do is read the processors current millisecond counter at the start of the step you want to test, System.getTimeMillis() should do it. Then you read the counter again at the end of the step. Calculate the time difference and print it out, then you have the real time of the step.
Also be ware of using operating system tools for a quick reading on the memory usage etc. of a program, the interpretation of the numbers are usually wrong. In linux for example, the memory number given in the ps command represents the total of the program including all shared libraries. That is not the number you are looking for, but rather the size of the *your* programs allocated memory, not necessarily all the memory used by all libraries aswell.
As suggested by others, use a profiling tool before you start doing any optimisations. Time and again, it has been shown that people spend time optimising the wrong part of the program.
Further on, if you think the hashmap is the major problem, use a different map library than what is provided in the jdk. I can suggest several: fastutil, trove, javolution (its up to 5 times faster than the jdk version)
here is a link to a list of different 3rd party library implementations (open source) which you might want to use.
http://www.manageability.org/blog/stuff/open-source-java-data-structures-and-alg orithms/view
PS. I would also recommend trying to make your code a little more readable for others. It does not contain comments (javadoc style), the java style is not used and I found the code a bit difficult in it self to read. Perhaps because there is no empty lines in between methods and blocks, I generally use two empty lines for that.
yours sincerely
tom
> C++: 94 seconds. > my java: 227 seconds. [quoted text clipped - 295 lines] > /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// > // End of post. Chris Uppal - 11 Feb 2006 15:27 GMT > You can not time an operation by just commenting out a step and then > rerun the program. A lot of the time taken is from other/external > effects which are not relevant to the step. And Jeffrey Schwab wrote:
> [...] For something as small as the OP's > example (which you snipped), Java's startup time is significant. For a > long-running, I/O bound, multi-tiered application, I think the > performance difference would dwindle. Your points are valid in general, but you both seem to have missed the /numbers/ from Kevin's post:
Kevin wrote:
> C++: 94 seconds. > my java: 227 seconds. When we are talking about numbers like that -- specifically the /very large/ numbers (for both versions of the program) -- then the JVM startup time is essentially irrelevant, and the "noise" in the figures from OS effects will be negligible.
And given that the C++ version took 94 seconds, we must assume that it had to process a great deal of data (like Kevin's other examples). In which case it would be almost completely disk-bound -- I/O bandwidth would dominate the execution time. Under the circumstances, there is no excuse for a Java program not to be similarly limited by I/O time. Fortunately we don't need to scrabble around for an explanation for the embarrassing observation that it is /not/ so limited, since Kevin has already (partially) explained it ;-)
-- chris
Kevin - 11 Feb 2006 17:11 GMT ####################### problem solved ##################### ####################### problem solved ##################### ####################### problem solved ##################### ####################### problem solved ##################### Thanks all for the reply. I posted another topic and explained it. I brief it again here.
It is not my java code's prolem. I found the detailed guide for the c++ program. It turned out that the data file is in a format that favor that c++ code: it encodes the hash value of each word into the file already (which I previously thought was just some part of the data). So the c code does not need to do the hash step, only the count step. After I took that special data format into my code, my java code runs as fast as the c++ code. Thank you all for the reply and advice!~ I really learned a lot from this group. Thank you.
Below, I briefly state how I do it in case someone will find useful. For fast file (plain ascii in my case) IO: 1) read in the file using InputStream, each time ready in 32K data into a byte[] buffer. 2) write your own "readLine()" method, which scan in the byte[] buffer, and return a new byte[] as a line. 3) write your own "split(char c)" method, which break one byte[] into many byte[].
####################### problem solved ##################### ####################### problem solved ##################### ####################### problem solved #####################
Thomas Hawtin - 11 Feb 2006 20:13 GMT > It is not my java code's prolem. I disagree.
> I found the detailed guide for the c++ program. > It turned out that the data file is in a format that favor that c++ > code: it encodes the hash value of each word into the file already > (which I previously thought was just some part of the data). My initial reaction to reading that was: "Oh my god, you don't want to do that."
This really should be a I/O bound problem. Adding extra data is not the way to may it faster.
> So the c code does not need to do the hash step, only the count step. That is not necessarily why the C code is faster.
> After I took that special data format into my code, my java code runs > as fast as the c++ code. You are changing more than you think you are.
I didn't look through your code first time around because it had been made unnecessarily difficult to read.
Looking at it again my first thought was that you were repeatedly calculating the hash code every time. When you changed the program to read the hash code from the file, you presumably stopped doing that. Even so, the method should not be executed frequently.
Looking at the hash algorithm, it doesn't look conventional. Effectively what you have is:
Sum s[i] * ((2^5 - 1) ^ i+1)
The binary representation of powers of one less than a power of two are not as "random" as you might think, particularly in the last few binary digits. For a hash map/table this may result in poor distribution of entries and hence very poor performance.
The exact impact will depend upon the implementation of the hash map. Using HashMap on Sun's 1.4.1 may well turn out to be much slower than in 1.4.0 or 1.4.2.
The importance of reading the hash code from the file was that it is calculated with a better algorithm. That appears to have transformed the I/O bound problem into a CPU bound solutions.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Roedy Green - 12 Feb 2006 05:03 GMT On Sat, 11 Feb 2006 20:27:26 +0000, Thomas Hawtin <usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone who said :
>Looking at the hash algorithm, it doesn't look conventional. Effectively >what you have is: > > Sum s[i] * ((2^5 - 1) ^ i+1) see some fast hashCode algorithms at http://mindprod.com/jgloss/hashcode.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Luc The Perverse - 12 Feb 2006 06:16 GMT > That is not necessarily why the C code is faster. My initial gut reaction, and his solution both seem to hint at the idea that he was using an inefficient algorithm.
I have coded a word frequency calculator in C++ before, and used N^2 efficiency sorting algorithms and they were VERY slow.
So I disagree - he changed one thing and now they run in about the same amount of time as opposed to the java running more than double. To me that seems pretty conclusive.
 Signature LTP
:) Thomas Hawtin - 12 Feb 2006 10:59 GMT >> That is not necessarily why the C code is faster. > > My initial gut reaction, and his solution both seem to hint at the idea that > he was using an inefficient algorithm. It is vastly more important that the hash code produce a good answer than that it be ultra quick.
With two multiplies, the hash code computation wasn't as cheap as it might have been. Sun's JVM seems poor at replacing constant multiplies with shifts and adds/subtracts, although that is less important with faster multipliers. If you replace v = v*31; with v = (v<<5) - v; and perhaps look at two together (v = (v<<10) - 2*(v<<5) + v;) you might understand that the low order bits are not going to be fairly distributed.
> I have coded a word frequency calculator in C++ before, and used N^2 > efficiency sorting algorithms and they were VERY slow. Doing n operation on a hash map take O(n) time, apparently. Screw your hash code up, and it becomes O(n^2) and your caches burn for an extra constant time slow down.
> So I disagree - he changed one thing and now they run in about the same > amount of time as opposed to the java running more than double. To me that > seems pretty conclusive. How the hash code was calculation was performed changed (almost certainly to a slower technique). Checking the implementation of HashMap, it will only calculate the hash map once when inserting (or retrieving) an entry. There is no way that the time taken to calculate the hash code is going to make any impact on performance.
The original poster didn't know about the embedded hash code at the time of writing the original code. It seems unlikely that the file format happened to get the same hash value. So that was a separate change. Given the likelihood of collisions with the first algorithm, that is a very good candidate for causing the increased performance.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Luc The Perverse - 12 Feb 2006 11:18 GMT >>> That is not necessarily why the C code is faster. >> [quoted text clipped - 33 lines] > the likelihood of collisions with the first algorithm, that is a very good > candidate for causing the increased performance. I see your point - I must have misunderstood the first time round.
And I agree - I don't think there is any reason to hash at all. Just be careful so you don't make redundant copies of strings and you should be fine.
I used a red black tree, but any balanced tree should be ok. I searched for entries, if I found them I incremented their count - if I didn't find them then I incremented nothing.
-- LTP
:) Chris Uppal - 12 Feb 2006 11:51 GMT > The original poster didn't know about the embedded hash code at the time > of writing the original code. It seems unlikely that the file format > happened to get the same hash value. So that was a separate change. > Given the likelihood of collisions with the first algorithm, that is a > very good candidate for causing the increased performance. There's also the marginal effect that the OP was probably interpreting the hash code strings as words and adding them to the table too. Not only would that double the number of words in the table, the words themselves would have a much reduced distribution (all decimal digits, or some such) which might further exacerbate any problem caused by weak hashing.
-- chris
tom fredriksen - 11 Feb 2006 17:28 GMT > When we are talking about numbers like that -- specifically the /very large/ > numbers (for both versions of the program) -- then the JVM startup time is [quoted text clipped - 5 lines] > would be almost completely disk-bound -- I/O bandwidth would dominate the > execution time. This is exactly my point, plus, that if the test is performed several times and the numbers are simply added together the then IO time is multiplied by as many times as the numbers are added, which suddenly makes the result exponentially larger and therefore horrendously wrong.
Measuring the time to read the entire file is one thing, measuring that plus other steps as an accumulation of each other is a completely different matter and would produce the wrong result. What I said earlier still stands if one wants correct results of performance measurements.
/tom
Roedy Green - 11 Feb 2006 18:23 GMT On Sat, 11 Feb 2006 15:27:56 -0000, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>And given that the C++ version took 94 seconds, most of these times you are comparing an single person writing both programs. Usually they are expert in one language and inept in the other. You must compare implementations by competent people, not a hack port of one language to another.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 11 Feb 2006 18:21 GMT On Sat, 11 Feb 2006 14:24:05 +0100, tom fredriksen <tom@spam-me-not.net> wrote, quoted or indirectly quoted someone who said :
>. What you have to do is read >the processors current millisecond counter With 1.5 there is an ever more accurate nanosecond timer. See http://mindprod.com/jgloss/time.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
tom fredriksen - 11 Feb 2006 22:33 GMT > On Sat, 11 Feb 2006 14:24:05 +0100, tom fredriksen > <tom@spam-me-not.net> wrote, quoted or indirectly quoted someone who [quoted text clipped - 5 lines] > With 1.5 there is an ever more accurate nanosecond timer. See > http://mindprod.com/jgloss/time.html I know. But not really the point, as you are dealing with an operation that is measured in minutes and seconds.
/tom
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|