Java Forum / General / February 2006
performance and memory usage.
arnold - 12 Feb 2006 00:02 GMT Hi
I created a test program to try the speed and size management of the JDK HashMap. With regards to this I was wondering if anybody has any comments on whether this is an appropriate way of testing this performance, i.e. how realistic is the results with respect to caching influence of data, heap management etc. The code is simple, but would it produce a correct result. (the program finishes in 11 seconds on my machine). the code is attached.
Secondly, I was wondering about the memory (heap) usage in java for this program. It needs at least 256 MB of heap to run to finish. cmd:
java -Xms32m -Xmx256m -cp target/classes/ App
Using -Xmx128m causes OutOfMemoryError.
When calculating on the size of the datastructure i find approx that
2M Dto's = 16MB 2M Integer's for map key = 8MB 2M Integer's for ArrayList = 8MB
Thats a total of 32MB plus some megabytes for the objects, datastructures, JVM and so on. So probably about 48-64MB in total. Thats a completely different number from 128 MB or 256 MB.
Does anybody have any comments on why the program requires that much memory?
arnie
*****************
/* Create 2 million Dto objects with random data and insert into HashMap also add key to an iterable list */ public void test1() {
int size = 2000000; Map<Integer, Dto> hmap = new HashMap<Integer, Dto> (size); Collection<Integer> alist = new ArrayList<Integer> (size); Random r = new Random();
Dto data; for (int c=size; c>0; c--) {
data = new Dto(); data.ip= r.nextInt(2000000000);
hmap.put(data.ip, data); alist.add(data.ip); }
int c = 1; for (Integer d : alist) { hmap.get(d); } }
/* Simple data placeholder */ public class Dto { public int ip; public int serialnum; }
Chris Smith - 12 Feb 2006 04:21 GMT > java -Xms32m -Xmx256m -cp target/classes/ App > [quoted text clipped - 5 lines] > 2M Integer's for map key = 8MB > 2M Integer's for ArrayList = 8MB First of all, your Dto class requires 8 bytes just for fields alone. Add to that at LEAST 8 bytes for an object header. That's 32MB at a minimum. The ArrayList needs four bytes per entry just for a pointer to the Integer object... that's another 8 MB. The Integer object itself has 4 bytes of data, and again at least 8 bytes of object header, so that's 24 MB. The HashMap data storage is complicated to estimate, but even completely ignoring the bucket pointers and structure, you still have 4 bytes for a pointer to the Dto and 4 bytes for a pointer to the integer... plus, since the field of Dto is of type int, there's a separate boxing conversion for the HashMap and a new set of Integer objects, so that's another 8 MB + 8 MB + 24 MB = 40 MB. Total so far: 104 MB, and that's assuming minimal object header size and a magical hashmap that requires no overhead for internal data structures.
Of course, you need some spare memory for any dynamic memory allocation algorithm. For a free list (C++-structured), that would be about 8 bytes per object, or at least an extra 48 MB. For a more realistic garbage collected system, this varies a lot more... but a good rule of thumb is to multiply the memory requirement by two if you want decent performance, meaning at least an extra 104 MB.
So, this easily accounts for running out of memory with 128 MB. In fact, it's quite likely that you're memory-starving the garbage collector with 256 MB, and would do better time-wise to provide a little more.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
arnold - 12 Feb 2006 10:03 GMT > So, this easily accounts for running out of memory with 128 MB. In > fact, it's quite likely that you're memory-starving the garbage > collector with 256 MB, and would do better time-wise to provide a little > more. Thats actually quite terrible memory utilisation. In that case I have some questions:
1- Is there a more efficient way of storing this in memory, so you could store up to, say, 100 million objects? 2- Is there a method of creating simple data placeholders which are not objects, similar to structs in c/c++. 3- Are there any methods of making the data structures themselves more memory efficient.
In C I could write this so the program would not require more than approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays with the pointers to the structures containing the data).
To me its seems that you should not really use java for in-memory storage, except for smaller data sets.
arnie
Chris Smith - 12 Feb 2006 17:19 GMT > Thats actually quite terrible memory utilisation. It's not all that bad; you're just storing a lot of data. In any case, it's a realtively small constant multiple of what you'd get from any other language.
> In that case I have some questions: > > 1- Is there a more efficient way of storing this in memory, so you could > store up to, say, 100 million objects?
> 3- Are there any methods of making the data structures themselves more > memory efficient. There are a few things you could do, if you're really desperate. For example, you can save a lot of space by re-implementing HashMap and ArrayList to use primitive int data instead of pointers to objects of class Integer.
> 2- Is there a method of creating simple data placeholders which are not > objects, similar to structs in c/c++. No.
> In C I could write this so the program would not require more than > approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays > with the pointers to the structures containing the data). No, you couldn't. You're still ignoring the overhead for memory allocation data structures. That overhead exists in C as well as in Java.
The only difference is that in Java, objects carry an additional 8 bytes or so of object header, and that the HashMap and ArrayList classes work with Object and not int (solvable, as I mentioned, if you are willing to re-implement those classes).
> To me its seems that you should not really use java for in-memory > storage, except for smaller data sets. You should, ideally, use a database for storing large data sets. Ideally, you wouldn't implement your own database anyway, in any language.
It's definitely possible to implement a database in Java if you so desire, but you wouldn't store everything in object-oriented data structures; instead, you'd use memory mapped files and java.nio.Buffer to access the data. In-memory data structures are designed to be convenient and capable for manipulating working data, not to be an efficient means of data storage.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
tom fredriksen - 12 Feb 2006 20:21 GMT > There are a few things you could do, if you're really desperate. For > example, you can save a lot of space by re-implementing HashMap and > ArrayList to use primitive int data instead of pointers to objects of > class Integer. I suppose there are some libraries which has alredy done these implementations.
>>Thats actually quite terrible memory utilisation. > [quoted text clipped - 9 lines] > allocation data structures. That overhead exists in C as well as in > Java. how so? Yes, its a lot of data, but the point is that the data is only about 32MB of pure data, the rest of the approx 150MB are internal data for the structures and objects etc (I tested the limits). That yields a utilisation percentage of 15%, which is quite terrible.
In C what you need is a hash array of 4 bytes per entry, and a second array for iteratiing the keys of 4 bytes per entry. then there is the hash data which is 8 bytes per entry. Thats 4 + 4 + 8 * 2million which is 32MB, and that includes all intrinsic data the strucures need. If I am missing something please let me know. (the hash structure might need some more space (i dont know the actual implementation of the function), but its most likely not more than an additional 4-8 bytes per entry, and that would take us up to 48MB.
>>To me its seems that you should not really use java for in-memory >>storage, except for smaller data sets. > > You should, ideally, use a database for storing large data sets. > Ideally, you wouldn't implement your own database anyway, in any > language. Databases and network operations are slow and costly (in money for customer solutions), plus you then have an additional 2-3 programming tiers, in addition the customer gets unnecessary system maintenance.
I am a bit sick of the notion in todays programming world that everything can only be solved by another heavyweight system which is excruciately complex to use and costs an arm and a leg in money and maintenance. Thats just a waste of money and time. Most problems can be solved by much simpler solutions (which are just as adaptable, scalable and possibly contains inherent maintenance), than the tomcat/struts/IBM/BEA/ORACLE/SAP type solutions. Considered BerkleyDB? JavaSpaces? Hashmap with ReiserFS? The problem is that most people dont realise this, so there are'nt many good or well known solutions for such systems.
arnie
Chris Smith - 12 Feb 2006 22:13 GMT > I suppose there are some libraries which has alredy done these > implementations. Well, maybe. For integers, yes there probably are. You can save much more space by inlining the custom data structure into the hash table entries as well, and obviously that would not pre-exist.
Assuming you want to just inline ints... for ArrayList, the complexity of an extra dependency on some third-party code is probably not worth saving the couple dozen lines of code to do this on your own. The HashMap will be more complex, and may be worth the new dependency.
> >>In C I could write this so the program would not require more than > >>approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays [quoted text clipped - 5 lines] > > how so? Dynamic memory management is not magical in C. It uses data structures to work, just like anything else. Those data structures are typically stored in the bytes directly before the memory address that's returned to you from a call to malloc. You are assuming that when you write "malloc(8)", that requires 8 bytes of memory. In reality, it requires about twice that amount, depending on the malloc implementation. Ignoring this makes your comparison unreliable.
Incidentally, I'm still ignoring malloc overhead from memory fragmentation. This overhead won't show up in a simple test program, but will become larger over time in a long-running application, generally asymptotically approaching a constant proportion of the total memory requirement. Modern malloc implementations are pretty good, so if you wanted to include this, you'd generally assume this memory is about 20% or so of the actual memory in use. This is worth mentioning, because this fragmentation would not be present in Java, in which defragmentation is generally a side-effect of garbage collection.
> (the hash structure might need some more space (i dont know the actual > implementation of the function), but its most likely not more than an > additional 4-8 bytes per entry, and that would take us up to 48MB. Okay, let's get more specific.
I'll assume you use a straight-forward implementation of a general- purpose hash table written in C. A hash table will require a set of buckets, and a linked list of data for each bucket. On a 32-bit machine, that's about 8 MB divided by the load factor, for the bucket heads. The linked list for the data will contain a pointer to the data itself, a pointer to the key, and a pointer to the next piece... plus possibly a cached hash value, but we'll ignore that by assuming you've chosen an implementation that opts for mem0ory compactness instead of performance. That is an extra twelve bytes per entry of visible overhead, plus about eight bytes per entry of memory management overhead. You'll need to dynamically allocate space for the keys, so that's an extra 4 bytes of visible memory, plus 8 bytes of allocation overhead. Assuming a load factor of 2, that brings us to 4 MB + (n * 20) + (n * 12), which is a total of 68 MB of overhead for the hash table.
You can eliminate about 32 MB of that 68 MB by writing your own hash table, in which case you could inline the data and keys into your hash entry structure and save the pointers and dynamically allocated keys. Similar savings can be had by using C++ templates, and would be expected in TR1's STL hash structure for example.
So we have an additional overhead of either 68 MB or 36 MB, depending on your implementation choices. The total for C is either 100 MB or 68 MB, not counting some negligible other sources of memory requirement.
So let's take the answer to be 68 MB. In Java, if you similarly write your own data structures to inline the data, you'd use an additional 16 MB on top of that, for a total of 84 MB. Leaving space for the garbage collector to work, you might expect 128 MB to be sufficient (remember that we already accounted for about eight bytes per object that aren't needed in a Java garbage collector since there's generally no explicit free list, so we don't need to double it).
Or let's take the answer to be 100 MB if you don't want to write your own hash table. In Java, you'll get an additional 48 MB, for a total of 148 MB. Make that about 200 MB for reasonable garbage collector performance (same note about not needing the overhead to keep a free list).
Or, if you want to use an ArrayList of Integer in Java instead of managing your own int[], then the memory requirement goes up to 188 MB, or maybe 256 MB for reasonable gc performance. There has been no C equivalent proposed for such a design, but it would probably require about 124 MB of memory (similar methods).
So, overall, it appears that Java requires about twice the memory of C for all possible design alternatives. C++ shows a considerable advantage, though, in that you can use STL data structures vector and TR1's unordered_map, and get the 68 MB result without having to re- implement anything.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
arnold - 13 Feb 2006 13:34 GMT > Okay, let's get more specific. Please see attached code sample.
First of, I've implemented this using the C language idioms, which one should do instead of trying to copy the java or OO idioms.
Secondly, you should be aware of one thing about this implementation. That is that the hash table implementation uses character keys, not integer keys as my java implementation. In order to accomodate this I needed to create a string represenation of the key, So this adds a modest size to the program. Also I dont know how much memory the table allocates for a 2mill entry table, show that should add some extra compared to HashMap, All in all its still smaller than your optimal C++ impl. and about 3-4 times smaller than my java impl. (With 4 Gigs of memory this program would be able to store 125 million entries, while the java version would store 48 million entries)
FYI, this program took 3.18 seconds to complete the full implementation, where as my java version needed 11 seconds.
Here are the memory stats for this program. The VIRT column is the most representative, allthough it includes code and libraries sizes. You can use the DATA column if you like, though.
Test 1: No datastructures allocated,
PID VIRT RES SHR SWAP CODE DATA COMMAND 24125 1432 280 1400 1152 8 1424 hash
Test 2: Allocating an iteration array of 2M entries
PID VIRT RES SHR SWAP CODE DATA COMMAND 24134 9248 316 1400 8932 8 9240 hash
Test 3: Allocating the hashtable with no data
PID VIRT RES SHR SWAP CODE DATA COMMAND 24573 32688 30m 1400 1100 8 31m hash
Test 4: Full implementation
PID VIRT RES SHR SWAP CODE DATA COMMAND 24753 63972 61m 1400 1096 8 62m hash
************ code **************
#include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <search.h>
/* Placeholder for data */ typedef struct dto { int ip; int serialnum; } dto_t;
int main(int argc, char **argv) { int *key = NULL; /* The key array for iterating the hash */ int size = 2000000; /* The number of entries*/ if (!hcreate((size_t) size)) { printf("Error creating hashtable\n"); exit(0); }
int r = 0; dto_t *data = NULL; ENTRY e, *ef; char r_str[20];
/* Create data and insert into hash and key array */ key = malloc(sizeof(int) * size); for(int c=0; c<size; c++ ) {
/* A random number between 1 and 2billions */ r = 1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
key[c] = r; sprintf(r_str, "%i", r); e.key = r_str; data = malloc(sizeof(data)); data->ip = r; e.data = (dto_t *)data;
hsearch(e, ENTER); }
/* Iterate the key array and retrieve hash data */ for(int c=0; c<size; c++ ) {
sprintf(r_str, "%i", key[c]); e.key = r_str; ef = hsearch(e, FIND); if(ef) data = (dto_t *) ef->data; } return(0); }
megagurka - 13 Feb 2006 14:47 GMT arnold skrev:
> > Okay, let's get more specific. > > Please see attached code sample. > > First of, I've implemented this using the C language idioms, which one > should do instead of trying to copy the java or OO idioms. AFAICT there is nothing that stops you from implementing the same data structure in Java as you do in C.
/JN
arnold - 13 Feb 2006 15:00 GMT > arnold skrev: > [quoted text clipped - 7 lines] > AFAICT there is nothing that stops you from implementing the same data > structure in Java as you do in C. Thats true, but in a real scenario it would be relevant for the hashmap only. I dont know how much more efficient it would be though. (the arraylist is of no interrest, since it only used for testing. In any case it can be implemented as a pure int array.)
/arnie
arnold - 12 Feb 2006 20:25 GMT > There are a few things you could do, if you're really desperate. For > example, you can save a lot of space by re-implementing HashMap and > ArrayList to use primitive int data instead of pointers to objects of > class Integer. I suppose there are some libraries which has alredy done these implementations.
>> Thats actually quite terrible memory utilisation. > > It's not all that bad; you're just storing a lot of data. In any case, > it's a realtively small constant multiple of what you'd get from any other language.
>> In C I could write this so the program would not require more than approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays with the pointers to the structures containing the data).
> No, you couldn't. You're still ignoring the overhead for memory allocation data structures. That overhead exists in C as well as in Java.
how so? Yes, its a lot of data, but the point is that the data is only about 32MB of pure data, the rest of the approx 150MB are internal data for the structures and objects etc (I tested the limits). That yields a utilisation percentage of 15%, which is quite terrible.
In C what you need is a hash array of 4 bytes per entry, and a second array for iteratiing the keys of 4 bytes per entry. then there is the hash data which is 8 bytes per entry. Thats 4 + 4 + 8 * 2million which is 32MB, and that includes all intrinsic data the strucures need. If I am missing something please let me know. (the hash structure might need some more space (i dont know the actual implementation of the function), but its most likely not more than an additional 4-8 bytes per entry, and that would take us up to 48MB.
>> To me its seems that you should not really use java for in-memory storage, except for smaller data sets.
> You should, ideally, use a database for storing large data sets. Ideally, you wouldn't implement your own database anyway, in any language.
Databases and network operations are slow and costly (in money for customer solutions), plus you then have an additional 2-3 programming tiers, in addition the customer gets unnecessary system maintenance.
I am a bit sick of the notion in todays programming world that everything can only be solved by another heavyweight system which is excruciately complex to use and costs an arm and a leg in money and maintenance. Thats just a waste of money and time. Most problems can be solved by much simpler solutions (which are just as adaptable, scalable and possibly contains inherent maintenance), than the tomcat/struts/IBM/BEA/ORACLE/SAP type solutions. Considered BerkleyDB? JavaSpaces? Hashmap with ReiserFS? The problem is that most people dont realise this, so there are'nt many good or well known solutions for such systems.
arnie
Roedy Green - 13 Feb 2006 09:42 GMT >There are a few things you could do, if you're really desperate. For >example, you can save a lot of space by re-implementing HashMap and >ArrayList to use primitive int data instead of pointers to objects of >class Integer. One goofy thing about HashMap vs the way it used to be done is there are three objects for each entry, a key, an object and a piece of glue to tied them together with overflow chains.
In the olden days you, might have left a slot in the object itself for the chains, and used a key embedded in the object. You then have only one object per item.
The problem with that sort of old-fashioned implementation is you must think ahead and leave a slot in your objects for HashSet to use, and you can't put the object in more than one HashSet.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
megagurka - 13 Feb 2006 10:13 GMT arnold skrev:
> > So, this easily accounts for running out of memory with 128 MB. In > > fact, it's quite likely that you're memory-starving the garbage [quoted text clipped - 10 lines] > 3- Are there any methods of making the data structures themselves more > memory efficient. Try these primitive type collection libraries:
http://pcj.sourceforge.net/ http://trove4j.sourceforge.net/
/JN
Roedy Green - 13 Feb 2006 10:31 GMT >Try these primitive type collection libraries: > >http://pcj.sourceforge.net/ >http://trove4j.sourceforge.net/ for even more see http://mindprod.com/jgloss/collection.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 12 Feb 2006 12:00 GMT >Does anybody have any comments on why the program requires that much memory? One problem is your HashMap size is wrong. You don't specify the number of elements you want to hold, but the capacity. You must allow some slop at least enough to account for the load factor.. It ran out of room and had to double the size temporarily holding both old and new.
See http://mindprod.com/jgloss/hashmap.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
arnold - 12 Feb 2006 20:26 GMT >>Does anybody have any comments on why the program requires that much memory? > [quoted text clipped - 5 lines] > > See http://mindprod.com/jgloss/hashmap.html Theoretically it might be true, but it made no practical difference. I tried increasing the size of both the Map and the Collection by a multiple of 2 and 4 and decreased the heap size, it made no difference. Both this change and the original code needed approx 180MB of heap to work.
arnie
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|