Java Forum / General / April 2006
Best data structure to use??
Ben - 15 Apr 2006 19:01 GMT Hi,
I am trying to find the best way to store an number of objects.
> The number of objects it may have to hold is 10, 000 however it is more likely only ever to need 5,000. > It will constantly having objects added and removed from it. > The objects don't have to be retrieved individually. Any removals will be done by searching through the list and removing them if they meet a specific condition. I was originally using an arraylist but have been told this isn't very good because if one element is removed all the objects have to be moved along one. Is this true? if so what is a better solution?
Thanks.
Ben.
Martin Gregorie - 15 Apr 2006 19:36 GMT > Hi, > [quoted text clipped - 3 lines] >> The objects don't have to be retrieved individually. Any removals will be done by >> searching through the list and removing them if they meet a specific condition.
> I was originally using an arraylist but have been told this isn't very > good because if one element is removed all the objects have to be moved > along one. Is this true? if so what is a better solution? I think you're in red-black binary tree territory here: searching a 4,000 node tree takes at most 12 comparisons and searching a 16,000 node tree at most 14 comparisons: the bigger the tree the relatively better the performance becomes. Insertions and removals are fast because they adjust links without moving any nodes. A red-black tree is a type of ordered binary tree that rebalances itself as nodes are added or removed, so it doesn't degenerate into a list if you build it by adding pre-ordered nodes.
The TreeMap class implements it. A Collection defined on a TreeMap will let you iterate through the whole tree.
HTH
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Roedy Green - 16 Apr 2006 05:37 GMT On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie <martin@see.sig.for.address> wrote, quoted or indirectly quoted someone who said :
> A red-black tree is a type of >ordered binary tree that rebalances itself as nodes are added or >removed, so it doesn't degenerate into a list if you build it by adding >pre-ordered nodes. Do red-black trees appreciate a presort by key before building?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Larry Barowski - 16 Apr 2006 08:06 GMT > On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie > <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 6 lines] > > Do red-black trees appreciate a presort by key before building? Quite the opposite. Insertion time will be increased (by a small constant factor) over randomly ordered input. The result will be well balanced though.
Martin Gregorie - 16 Apr 2006 13:25 GMT > On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie > <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 6 lines] > > Do red-black trees appreciate a presort by key before building? Red-black trees *avoid* the problem that unbalanced trees have with presorted data. If the data is inserted in key sequence an unbalanced tree algorithm will insert each new node as the right brother of the rightmost node, so you end up with a tree that has degenerated into a list and, as you'd expect, search performance is horrid. Reverse sorted data is just as bad: you still end up with a fully degenerate tree but this time it grows on the left.
The red-black tree avoids this problem because it re-balances as needed after each insert: this adds a small overhead but all branches are the same length within +/- one node. The overhead is small because rebalancing just involves swapping pointers, not moving or reallocating nodes.
The bottom line is that unbalanced trees should be fed unsorted data and will still have unequal branch lengths, but red-black trees are unaffected by data ordering.
If you want all the gory details for different types of tree, find a copy of "Algorithms" by Robert Sedgewick (pub. Addison-Wesley).
==== One caveat: the performance of red-black binary trees is critically dependent on the data structures used to build the tree and on the performance of the memory allocator. Here's an illustration:
The standard C implementation uses a hidden structure made of small nodes. Each holds three pointers (left & right brothers plus one to the user-defined structure holding the key and data). Hence, adding a node requires two separate memory allocations for small bits of memory. This can be brain-hurtingly slow compared with an algorithm that's optimized to use one structure for key, data and pointers and that grabs large chunks of memory so it can store lots of nodes in each chunk.
I did such an optimization on a 150 MHz Alphaserver running Digital Unix (yeah, that long ago: 1997). I had enough test data to build a tree with 200,000 nodes (6.5 MB for the tree) so the measurement statistics were pretty good. My initial implementation using the standard C library loaded the tree at 700 inserts/sec. Each insert involved a tree search to find the insertion point, two memory allocations and a pointer adjustment to rebalance the tree. My first rewrite used optimized code swiped from Sedgewick and combined the pointers and data into a single structure. This got me up to 1500 inserts/sec during loading, but I needed better than 3000 inserts/sec to hit the system performance target. My final rewrite introduced a custom memory allocator that was optimized by making a few requests on the OS for large chunks of memory and fit lots of nodes into each chunk. The first chunk requested was 1MB and this doubled for each successive request. The tree loading speed went up to 17,000 inserts/sec. Obviously the Digital Unix memory allocator was rather inefficient!
The TreeMap class may well have the same performance problem because it too uses a hidden structure to define the tree but adds TWO memory allocations because the key and data are two separate object instances. This has to give a performance hit but the size of the hit depends entirely on the efficiency of the JVM memory allocator.
HTH
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
antonyf - 15 Apr 2006 19:46 GMT Hi Ben,
Have a look at the Collection framwork and you'll may find what's better suited for you.
http://java.sun.com/j2se/1.4.2/docs/guide/collections/overview.html
>From what I understand maybe a Set collection may suit you (HashSet, TreeSet or LinkedHashSet)?
Missaka Wijekoon - 15 Apr 2006 20:25 GMT > Hi, > [quoted text clipped - 11 lines] > > Ben. Ben,
The other responders had good suggestions. Also, try to create a test case that you think will simulate the workload in a production environment. What performs better may depend on the number of adds to removes ratio. Also, multi-threading and stored object sizes may make a difference as you attempt to iterate through the collection. How often do you iterate?
Cheers, Missaka
Larry Barowski - 16 Apr 2006 04:49 GMT > Hi, > [quoted text clipped - 9 lines] > good because if one element is removed all the objects have to be moved > along one. Is this true? if so what is a better solution? Is the "specific condition" something you need to test against every item, or can it be hashed (something like size = 6 or name = "Fred")? If it requires testing every item, how many deletes will there be on average for each test-sweep?
How often, if ever, do you need the items to be in a particular order, relative to how often you will add and remove items?
If the "specific condition" requires testing every item and you will never need the items in order, then a LinkedList may be good (and you would use ListIterator.remove() to remove items). But if you know there will never be more than 10,000 items, an array with 10000 entries should be faster. If there will be many deletes per test-sweep, just use a sweep-and-copy approach. Like this: int deletes = 0; for(int i = 0; i < num_items; i++) { if(should_delete(items[i])) deletes++; else if(deletes > 0) items[i - deletes] = items[i]; } num_items -= deletes; // If holding on to old items (not letting them be garbage // collected) is cheap or free, then you don't need this. for(int i = 0; i < deletes; i++) items[num_items + i] = null;
If the "specific condition" can be hashed and you never need the items in order, then you definitely want to use a HashSet.
If you frequently need the items in order (and that order is not the insertion order), which it doesn't sound like you do, then you may want to use a TreeSet.
Ben - 17 Apr 2006 20:47 GMT The list will be iterated through every second. The test case for removal is got from a method of the stored objects. If value returned is true then it should be removed. Items will be added and removed every second. In this situation I can use a key to identify the indivual objects. Does this mean i can't use a treemap?
Martin Gregorie - 18 Apr 2006 12:24 GMT > The list will be iterated through every second. The test case for > removal is got from a method of the stored objects. If value returned > is true then it should be removed. Items will be added and removed > every second. In this situation I can use a key to identify the > indivual objects. Does this mean i can't use a treemap? You say that the list will be scanned and weeded once a second. The optimum storage structure depends on the proportion if scans, searches and alterations per second, i.e., if several hundred items are retrieved per second, then the efficiency of scanning the complete set of items is relatively unimportant unless it interferes with item retrieval.
Do you have any statistics or estimates on the number of: - items read per second - items added per second - items removed per second
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Ben - 18 Apr 2006 13:41 GMT Actually the list will be fully scanned repeatedly with a delay between each scan. The delay may be as small as 100 milliseconds.
I don't forsee the list really going beyond 5,000, however it needs to be able to handle 10,000. The items added/removed will vary. In one scan the at least 80% of the objects may be removed. Other times maybe 0%. I can't give a item per second estimate.
Martin Gregorie - 18 Apr 2006 22:37 GMT > Actually the list will be fully scanned repeatedly with a delay between > each scan. The delay may be as small as 100 milliseconds. [quoted text clipped - 3 lines] > scan the at least 80% of the objects may be removed. Other times maybe > 0%. I can't give a item per second estimate. Are you saying that:
(a) the only operation on the list is to scan it, removing items that meet a removal criterion and adding any items that are queued for addition?
(b) nothing ever searches the list for an item with a matching key value
I appreciate that the number of items added and removed during (a) may be very variable, but you really need an estimates for:
- the average number of items removed during each scan. - the average numbers of items added during each scan - the number if times (b) happens for each occurrence of (a)
If you really can't estimate these statistics, have you got a real-life data feed from which you can capture a decent sized sample? A few thousand events should be plenty. By 'event' I mean: - a scan being triggered - an item's state being altered so it will be deleted on the next scan - the arrival of an item to be inserted. - a search being triggered - ....
Ideally you need a time stamp or wait interval after the previous event so you can run reproduceable real-time tests against your captured data, but even without timestamp you can learn a *lot* by running tests against a captured event list like this as you develop the code.
If you have a suitable data feed, then I suggest you build a first cut class with the simplest possible storage structure and access methods. Don't make any attempt to optimize it: simplicity is everything at this stage. A good starting point would to choose a fixed size array that can hold the entire set of events in the captured data. Write methods that insert at the end, delete by shifting elements down, and search with a serial scan. Instrument it and get some statistics off it. You'll learn a lot from doing this and, most importantly, the statistics will show you what needs optimization and what doesn't.
Remember that most programs spend 90% of their time executing under 5% of their code: optimize that 5% and you can usually ignore the rest.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Larry Barowski - 19 Apr 2006 01:22 GMT > The list will be iterated through every second. The test case for > removal is got from a method of the stored objects. If value returned > is true then it should be removed. Items will be added and removed > every second. In this situation I can use a key to identify the > indivual objects. Does this mean i can't use a treemap? If you don't ever need the items in a sorted order (an order other than insertion order) then a TreeMap will be of no benefit over a hashed data structure. If you never need to look up items by some key, which is starting to sound like the case, then a hashed data structure will be of no benefit. In other words, if you're just adding new items to the end of the list, scanning through the list to remove items, and using the items in some way as an unordered set, then use a LinkedList or an array or ArrayList (with the efficient scan/remove procedure I posted earlier). Using a fixed- size array that is big enough or doing your own array growing/shrinking if necessary will be the fastest. Using a LinkedList will result in the simplest code.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|