Java Forum / General / December 2006
which collection to use
gaurav v bagga - 22 Dec 2006 19:59 GMT hi all,
I have key,value pair of structure and I want to store them in a collection. Which collection will be good with respect to retrieval speed.
I found out from net that HashMap is better then TreeMap in retrieval speed.
HashMap<String, String> hm = new HashMap<String, String>(); hm.put("1", "r"); hm.put("2", "r"); hm.put("3", "r"); hm.put("4", "r"); hm.put("5", "r"); hm.put("6", "r"); hm.put("7", "rd"); hm.put("8", "r"); hm.put("9", "r"); hm.put("10", "r"); hm.put("11", "r"); hm.put("12", "r"); hm.put("13", "r"); hm.put("14", "rd");
long time=System.nanoTime(); System.out.println(hm.get("10")); System.out.println(System.nanoTime()-time);
TreeMap<String, String> tm = new TreeMap<String, String>(); tm.put("1", "r"); tm.put("2", "r"); tm.put("3", "r"); tm.put("4", "r"); tm.put("5", "r"); tm.put("6", "r"); tm.put("7", "rd"); tm.put("8", "r"); tm.put("9", "r"); tm.put("10", "r"); tm.put("11", "r"); tm.put("12", "r"); tm.put("13", "r"); tm.put("14", "rd");
time=System.nanoTime(); System.out.println(tm.get("10")); System.out.println(System.nanoTime()-time);
but this code of mine gives result as
r 2405892 r 832229
it seems tree is better or is it that I am measuring the speed in wrong way?
regards gaurav
Steve W. Jackson - 22 Dec 2006 20:31 GMT > hi all, > [quoted text clipped - 57 lines] > regards > gaurav I doubt that your test is really meaningful, having so few actual values and using String for your keys as you do.
TreeMap is a sorted map. HashMap is not. Based on that, it's logical to expect some overhead on the TreeMap that could impact performance somewhat. Whether and when it becomes meaningful probably depends largely on how much data you've got and your retrieval patterns.
But your choice of one or the other should not be driven by your perception of speed in retrieving their contents, but by your program's needs. If you never need to iterate over your collection in order of its keys, why use a TreeMap? Understand the difference between the two, determine what your need is, and choose whichever best fits.
= Steve =
 Signature Steve W. Jackson Montgomery, Alabama
Flo 'Irian' Schaetz - 22 Dec 2006 21:13 GMT And thus spoke gaurav v bagga...
> I have key,value pair of structure and I want to store them in a > collection. Which collection will be good with respect to retrieval > speed. ...
> I found out from net that HashMap is better then TreeMap in retrieval > speed. NEVER EVER decide something like this based by throwing 14 values at it and look how much "Date" it takes... Sorry, but it doesn't say much. I wouldn't even trust such an experiment if you threw thousands of random numbers at it :-)
A better way to do it...
1. Read the JavaDoc of HashMap and TreeMap. There you'll find which algorithms are used by these classes. In some cases you'll find the complexity of these algorithms there.
2. Use the web or a good book to find out, which Algorithem has which complexity for which method, if you didn't find it in the JavaDoc.
THEN you can decide, which algorithm to use. If you do not understand an algorithm, you will never be able to decide if it's the best one for your job. If you don't know what "Complexity" or "O(n log n)" means, I would advise you to read _at least_ what wikipedia has to offer about these topics. You don't have to know how to prove a complexity, but you should understand, what it means :-)
Flo
Patricia Shanahan - 22 Dec 2006 21:20 GMT > And thus spoke gaurav v bagga... > [quoted text clipped - 27 lines] > > Flo I disagree somewhat with this advice. Big-O notation only tells you about the limiting behavior for large problems.
If there is a very frequent need to retrieve from small collections, the performance could be dominated by overheads and constant factor differences that disappear in the limiting case. There is then no substitute for measurement.
However, the measurement needs to be realistic. In particular, timing one retrieval makes no sense at all.
Patricia
Flo 'Irian' Schaetz - 22 Dec 2006 21:38 GMT And thus spoke Patricia Shanahan...
> I disagree somewhat with this advice. Big-O notation only tells you > about the limiting behavior for large problems. I just said "better" way, not "best" way :-) But you're right, I simply asumed, that he has a somewhat "bigger" problem to solve and not just 14 pairs, my fault. x*n can of course be bigger than n*n for small n, my fault, really.
Anyway, understanding the algorithm seems like the best way to help making a decision. Taking two classes which seem to do aprox. what one wants to have done and than compare them by using "Date" isn't in my Top 10, at least :-)
Flo
Patricia Shanahan - 22 Dec 2006 21:48 GMT > hi all, ....
> r > 2405892 [quoted text clipped - 3 lines] > it seems tree is better or is it that I am measuring the speed in wrong > way? You are measuring speed in a very wrong way. Just taking your test, but putting the work in separate methods and calling them multiple times, I got:
r Hash 1437744 r Tree 333823 r Hash 151559 r Tree 172166 r Hash 2030422 r Tree 182390 r Hash 165632 r Tree 181563 r Hash 173731 r Tree 177306 r Hash 185852 r Tree 177680 r Hash 154439 r Tree 183960
Obviously, the times are very variable, probably due to issues such as cache misses, other programs running...
To measure this properly, go back to the program in which this is a critical issue. Pick a typical list of items, and distribution of requests. Make sure your test program reproduces those. Measure many requests, and do the tests several times, alternating methods, so that you are not charging one method with initial overheads.
The measured piece of work should take order seconds or tens of seconds, so that the odd cache miss or page fault cannot affect the results.
Patricia
Martin Gregorie - 23 Dec 2006 01:10 GMT >> it seems tree is better or is it that I am measuring the speed in wrong >> way? Yes, you are measuring it the wrong way.
In any situation where the system is multi-tasking (and Java is always multitasking to a first approximation because there are at least two threads running) its meaningless to measure elapsed time. The only valid measure of efficiency for an task that doesn't involve i/o is the amount of CPU time it consumes because that, at least, is not affected by other processes unless the machine is so overloaded that every task switch involves paging or task swapping.
Patricia's runs show interference by other processes very nicely.
If you are using Linux or a UNIX you should at least use the "time" utility to measure the CPU time used. As you don't say what OS you're using I can't say more than that.
> To measure this properly, go back to the program in which this is a > critical issue. Pick a typical list of items, and distribution of > requests. Make sure your test program reproduces those. Measure many > requests, and do the tests several times, alternating methods, so that > you are not charging one method with initial overheads. Exactly. We can't advise you because we haven't any idea of: - the number of items you want to search - the relative size of the key and value in each item - the degree of key duplication in your data - the degree of key value clustering in your data - the complexity of the key comparisons [1]
As Patricia said, the number of items and the key value distribution are both vital information which you didn't provide.
The following assumes that all keys are not unique. It that's not true then you need a more complex approach that can handle duplicates.
If the data is static the following is generally true: if you only do infrequent searches of under 10 items a simple sequential search is probably best because its overheads are minimal. With a few hundred items and many searches of static data its better to sort the data once its loaded and use a binary chop search: SortedMap will work this way provided that you build an unsorted map first and then construct the SortedMap from it. Adding unsorted data item by item to an empty SortedMap is likely to be very slow.
If the data isn't static and is initially unordered then either a hash table (HashMap) or a red-black binary tree (TreeMap) is the way to go. If the HashMap hashing algorithm generates synonyms because of your key value distribution or because you've set the initial capacity too low its performance will degrade fast and the TreeMap would be better. If you need to read out the data in order then TreeMap is the only game in town.
[1] This alone can invalidate a theoretical analysis. For instance, most people will swear blind that Quicksort will always blow a Replacement sort into the weeds. No less an authority than Niklaus Wirth says so without any qualification of this opinion so it *must* be so! But his analysis assumes that swapping items is much more expensive than comparing them. If your item swap method is very fast (e.g. swapping references to large objects rather than copying the objects about in an array) and the key comparison is slow (e.g. a lexicographic string comparison over multiple keys rather than comparing single integer keys) then you'll find in practice that a Replacement sort is several times faster than Quicksort. I know this because I've been there.
> The measured piece of work should take order seconds or tens of seconds, > so that the odd cache miss or page fault cannot affect the results. Yes. You need a sample of real data to test against so you get a representative key distribution and the sample has to be large enough to give a realistic performance estimation for live data. This means it *must* be within 2 - 3 orders of magnitude of the design data volume and preferably within one order of magnitude. By way of illustration, I've seen several databases that ran just fine on typical programmer test volumes (under 10-50 items in the biggest table) but that had unusably bad performance with a few tens of thousands of rows loaded.
If all this is new to you, go and find a copy of Sedgewick: Algorithms. Its code examples are in Pascal, but any competent programmer should be able to translate that to Java in their head. There are also Java versions of this book - at a price.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Patricia Shanahan - 23 Dec 2006 10:31 GMT >>> it seems tree is better or is it that I am measuring the speed in wrong >>> way? [quoted text clipped - 8 lines] > processes unless the machine is so overloaded that every task switch > involves paging or task swapping. Yes and no. If the real objective is to measure CPU time it is fine to measure that. However, measuring CPU time does not just eliminate interference from other tasks, it also eliminates the job's own disk wait time.
Also, CPU time can show strange effects when measuring multiple threads on a hyperthreaded processor.
Ultimately, elapsed time is what matters.
Here's what I've done in the past when I wanted an accurate measure of elapsed time for one job:
1. Disconnect all networking.
2. Only one user logged in.
3. Kill all non-essential daemons. I used to know exactly what was needed to keep a Solaris system operational.
4. Kill all non-essential user jobs. Typically, I had one shell and the job being measured, no window manager.
4. Hands off keyboard and mouse for the duration of the run.
However, for many cases that is overkill, and I just go with the time to run a reasonably long piece of work.
Patricia
John Ersatznom - 23 Dec 2006 12:37 GMT > However, for many cases that is overkill, and I just go with the time to > run a reasonably long piece of work. I'd agree here. I'd tend to measure average elapsed time over a large number of jobs representative of the expected usage. The result should be a time measurement representative of what the users can expect -- and that number's more important than cycles or whatever.
I'd even have the system in the kind of state, multitasking-wise, that's likely to be representative of the typical deployment environment. Something than runs fast when it has sole use of the CPU but becomes a pig in molasses due to cache misses or whatever when it runs in a real-world situation will get caught this way, for starters. So will something that makes the rest of the system unusable while it's "thinking", where that may be a showstopper for the users. (Such a something at minimum needs to use a separate, lower-priority thread for the "thinking" and maybe explicitly yield now and again. I've also seen systems slowed to a crawl by software that grovels over the whole filesystem for hours, because of contention for disk access rather than CPU use. The antivirus on my development box runs daily at 8 am, generally taking until 10 to finish, and the system's a pain in the a.s to use during those two hours. Of course, a virus-riddled system would be an even bigger pain, so...)
Patricia Shanahan - 23 Dec 2006 14:40 GMT >> However, for many cases that is overkill, and I just go with the time to >> run a reasonably long piece of work. [quoted text clipped - 3 lines] > be a time measurement representative of what the users can expect -- and > that number's more important than cycles or whatever. I agree when the objective is to find out if a job, as a whole, has acceptable performance. If the job is too slow, the next stage is hunt-the-bottlenecks.
After the bottlenecks have been found, I often want to test alternative solutions with minimal interference from other jobs, so that relatively short runs will tell me which solution is faster.
Patricia
Martin Gregorie - 23 Dec 2006 19:53 GMT >>>> it seems tree is better or is it that I am measuring the speed in wrong >>>> way? [quoted text clipped - 13 lines] > interference from other tasks, it also eliminates the job's own disk > wait time. I believe I qualified my statement by saying that this was the better measurement for tasks that do not involve i/o.
If i/o is involved I pick a reasonably sized data set - as you do - but tempered to be (hopefully) representative of the real life work and key mix and run it several times, discarding inconsistent run times.
> Also, CPU time can show strange effects when measuring multiple threads > on a hyperthreaded processor. I'll take you word for that. The oddest platform I've used for performance prediction was what used to be Digital UNIX (based on the Mach kernel) running on Alpha chips which had odd and non-obvious chunks of serialized code in its guts, like the lstat() SVC.
> Ultimately, elapsed time is what matters. > [quoted text clipped - 15 lines] > However, for many cases that is overkill, and I just go with the time to > run a reasonably long piece of work. Looks good to me if, as you say, slight overkill. Of course the real fun comes from tracking down and eliminating the bottlenecks....
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
gaurav v bagga - 24 Dec 2006 05:42 GMT hi all, thanks for the advice but many things being discussed here are new to me so took time to reply.
as asked ------> Exactly. We can't advise you because we haven't any idea of: - the number of items you want to search =not more than 15 for 70% of time 30 for 25% and 40 for 5% - the relative size of the key and value in each item(if guessed it right you mean string length?) =always less than 20-25 - the degree of key duplication in your data =no duplicates - the degree of key value clustering in your data = dint get this - the complexity of the key comparisons [1] = went through[1] but things were not clear,will need time to assimilate [1] things new to me..... :) - OS = presently windows XP pro - static data? = data will be static most of the time (changes rarely)
Assuming you care about 'hot spots', then your microbenchmarks should loop many times and alternate between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot spots"
Obviously, the times are very variable, probably due to issues such as cache misses, other programs running... WHAT DO YOU MEAN BY "cache misses"
well this discussion has tought me many diffrent perspectives for performance measurement once again thanks all, good learning for me :)
regards gaurav
Patricia Shanahan - 25 Dec 2006 14:21 GMT ...
> Assuming you care about 'hot > spots', then your microbenchmarks should loop many times and alternate > between algorithms. Better, use real code. WHAT DO YOU MEAN BY "hot > spots" A "hot spot" is a small piece of code that contributes significantly to the execution time.
> Obviously, the times are very variable, probably due to issues such as > cache misses, other programs running... WHAT DO YOU MEAN BY "cache > misses" See http://en.wikipedia.org/wiki/Cache for a general introduction to caches. A "cache miss" is any load, store, or instruction fetch for which the processor has to go to a slower cache, or even memory, because the data is not in a level 1 cache.
Patricia
gaurav v bagga - 26 Dec 2006 05:00 GMT hi Patricia ,
thanks for reply
regards gaurav
Thomas Hawtin - 23 Dec 2006 14:40 GMT > r > 2405892 > r > 832229 Yeah, I get:
r 617475 r 187102
Oh, but I switched and did TreeMap first...
> it seems tree is better or is it that I am measuring the speed in wrong > way? Most modern JREs use adaptive compilers. They will start interpreting code and gradually compile bits and pieces, perhaps the same piece a number of times.
So you need to be extremely careful. Assuming you care about 'hot spots', then your microbenchmarks should loop many times and alternate between algorithms. Better, use real code.
One things microbenchmarks don't cover is seas of not very warm code. Apparently Swing is like this. For good performance there you need to write less code ("there is no code faster than no code").
Either way you tend to get better performance, surprisingly, by keeping methods small (Swing is really bad at this). In particular keep low-level hot spots away from less well used code (Swing is really bad at this). And factor out common code as best you can (Swing is really bad at this).
Tom Hawtin
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|