Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / April 2006

Tip: Looking for answers? Try searching our database.

Best data structure to use??

Thread view: 
Ben - 15 Apr 2006 19:01 GMT
Hi,

I am trying to find the best way to store an number of objects.
> The number of objects it may have to hold is 10, 000 however it is more likely only ever to need 5,000.
> It will constantly having objects added and removed from it.
> The objects don't have to be retrieved individually.  Any removals will be done by searching through the list and removing them if they meet a specific condition.

I was originally using an arraylist but have been told this isn't very
good because if one element is removed all the objects have to be moved
along one.  Is this true? if so what is a better solution?

Thanks.

Ben.
Martin Gregorie - 15 Apr 2006 19:36 GMT
> Hi,
>
[quoted text clipped - 3 lines]
>> The objects don't have to be retrieved individually.  Any removals will be done by
>> searching through the list and removing them if they meet a specific
condition.

> I was originally using an arraylist but have been told this isn't very
> good because if one element is removed all the objects have to be moved
> along one.  Is this true? if so what is a better solution?

I think you're in red-black binary tree territory here: searching a
4,000 node tree takes at most 12 comparisons and searching a 16,000 node
tree at most 14 comparisons: the bigger the tree the relatively better
the performance becomes. Insertions and removals are fast because they
adjust links without moving any nodes. A red-black tree is a type of
ordered binary tree that rebalances itself as nodes are added or
removed, so it doesn't degenerate into a list if you build it by adding
pre-ordered nodes.

The TreeMap class implements it. A Collection defined on a TreeMap will
let you iterate through the whole tree.

HTH

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Roedy Green - 16 Apr 2006 05:37 GMT
On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie
<martin@see.sig.for.address> wrote, quoted or indirectly quoted
someone who said :

> A red-black tree is a type of
>ordered binary tree that rebalances itself as nodes are added or
>removed, so it doesn't degenerate into a list if you build it by adding
>pre-ordered nodes.

Do red-black trees appreciate a presort by key before building?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Larry Barowski - 16 Apr 2006 08:06 GMT
> On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie
> <martin@see.sig.for.address> wrote, quoted or indirectly quoted
[quoted text clipped - 6 lines]
>
> Do red-black trees appreciate a presort by key before building?

Quite the opposite. Insertion time will be increased (by a
small constant factor) over randomly ordered input. The
result will be well balanced though.
Martin Gregorie - 16 Apr 2006 13:25 GMT
> On Sat, 15 Apr 2006 19:36:30 +0100, Martin Gregorie
> <martin@see.sig.for.address> wrote, quoted or indirectly quoted
[quoted text clipped - 6 lines]
>
> Do red-black trees appreciate a presort by key before building?

Red-black trees *avoid* the problem that unbalanced trees have with
presorted data. If the data is inserted in key sequence an unbalanced
tree algorithm will insert each new node as the right brother of the
rightmost node, so you end up with a tree that has degenerated into a
list and, as you'd expect, search performance is horrid. Reverse sorted
data is just as bad: you still end up with a fully degenerate tree but
this time it grows on the left.

The red-black tree avoids this problem because it re-balances as needed
after each insert: this adds a small overhead but all branches are the
same length within +/- one node. The overhead is small because
rebalancing just involves swapping pointers, not moving or reallocating
nodes.

The bottom line is that unbalanced trees should be fed unsorted data and
will still have unequal branch lengths, but red-black trees are
unaffected by data ordering.

If you want all the gory details for different types of tree, find a
copy of "Algorithms" by Robert Sedgewick (pub. Addison-Wesley).

====
One caveat: the performance of red-black binary trees is critically
dependent on the data structures used to build the tree and on the
performance of the memory allocator. Here's an illustration:

The standard C implementation uses a hidden structure made of small
nodes. Each holds three pointers (left & right brothers plus one to the
user-defined structure holding the key and data). Hence, adding a node
requires two separate memory allocations for small bits of memory. This
can be brain-hurtingly slow compared with an algorithm that's optimized
to use one structure for key, data and pointers and that grabs large
chunks of memory so it can store lots of nodes in each chunk.

I did such an optimization on a 150 MHz Alphaserver running Digital Unix
(yeah, that long ago: 1997). I had enough test data to build a tree with
200,000 nodes (6.5 MB for the tree) so the measurement statistics were
pretty good. My initial implementation using the standard C library
loaded the tree at 700 inserts/sec. Each insert involved a tree search
to find the insertion point, two memory allocations and a pointer
adjustment to rebalance the tree. My first rewrite used  optimized code
swiped from Sedgewick and combined the pointers and data into a single
structure. This got me up to 1500 inserts/sec during loading, but I
needed better than 3000 inserts/sec to hit the system performance
target. My final rewrite introduced a custom memory allocator that was
optimized by making a few requests on the OS for large chunks of memory
and fit lots of nodes into each chunk. The first chunk requested was 1MB
and this doubled for each successive request. The tree loading speed
went up to 17,000 inserts/sec. Obviously the Digital Unix memory
allocator was rather inefficient!

The TreeMap class may well have the same performance problem because it
too uses a hidden structure to define the tree but adds TWO memory
allocations because the key and data are two separate object instances.
This has to give a performance hit but the size of the hit depends
entirely on the efficiency of the JVM memory allocator.

HTH

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

antonyf - 15 Apr 2006 19:46 GMT
Hi Ben,

Have a look at the Collection framwork and you'll may find what's
better suited for you.

http://java.sun.com/j2se/1.4.2/docs/guide/collections/overview.html

>From what I understand maybe a Set collection may suit you (HashSet,
TreeSet or LinkedHashSet)?
Missaka Wijekoon - 15 Apr 2006 20:25 GMT
> Hi,
>
[quoted text clipped - 11 lines]
>
> Ben.

Ben,

The other responders had good suggestions.  Also, try to create a test
case that you think will simulate the workload in a production
environment.  What performs better may depend on the number of adds to
removes ratio.  Also, multi-threading and stored object sizes may make a
difference as you attempt to iterate through the collection. How often
do you iterate?

Cheers,
Missaka
Larry Barowski - 16 Apr 2006 04:49 GMT
> Hi,
>
[quoted text clipped - 9 lines]
> good because if one element is removed all the objects have to be moved
> along one.  Is this true? if so what is a better solution?

Is the "specific condition" something you need to
test against every item, or can it be hashed (something
like  size = 6  or  name = "Fred")? If it requires testing
every item, how many deletes will there be on average
for each test-sweep?

How often, if ever, do you need the items to be in a
particular order, relative to how often you will add
and remove items?

If the "specific condition" requires testing every item
and you will never need the items in order, then a
LinkedList may be good (and you would use
ListIterator.remove() to remove items). But if you
know there will never be more than 10,000 items,
an array with 10000 entries should be faster. If
there will be many deletes per test-sweep, just use a
sweep-and-copy approach. Like this:
int deletes = 0;
for(int i = 0; i < num_items; i++) {
   if(should_delete(items[i]))
       deletes++;
   else if(deletes > 0)
       items[i - deletes] = items[i];
}
num_items -= deletes;
// If holding on to old items (not letting them be garbage
// collected) is cheap or free, then you don't need this.
for(int i = 0; i < deletes; i++)
   items[num_items + i] = null;

If the "specific condition" can be hashed and you
never need the items in order, then you definitely
want to use a HashSet.

If you frequently need the items in order (and that
order is not the insertion order), which it doesn't
sound like you do, then you may want to use a
TreeSet.
Ben - 17 Apr 2006 20:47 GMT
The list will be iterated through every second.  The test case for
removal is got from a method of the stored objects.  If value returned
is true then it should be removed.  Items will be added and removed
every second.  In this situation I can use a key to identify the
indivual objects.  Does this mean i can't use a treemap?
Martin Gregorie - 18 Apr 2006 12:24 GMT
> The list will be iterated through every second.  The test case for
> removal is got from a method of the stored objects.  If value returned
> is true then it should be removed.  Items will be added and removed
> every second.  In this situation I can use a key to identify the
> indivual objects.  Does this mean i can't use a treemap?

You say that the list will be scanned and weeded once a second. The
optimum storage structure depends on the proportion if scans, searches
and alterations per second, i.e., if several hundred items are retrieved
 per second, then the efficiency of scanning the complete set of items
is relatively unimportant unless it interferes with item retrieval.

Do you have any statistics or estimates on the number of:
- items read per second
- items added per second
- items removed per second

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Ben - 18 Apr 2006 13:41 GMT
Actually the list will be fully scanned repeatedly with a delay between
each scan.  The delay may be as small as 100 milliseconds.

I don't forsee the list really going beyond 5,000, however it needs to
be able to handle 10,000.  The items added/removed will vary.  In one
scan the at least 80% of the objects may be removed.  Other times maybe
0%.  I can't give a item per second estimate.
Martin Gregorie - 18 Apr 2006 22:37 GMT
> Actually the list will be fully scanned repeatedly with a delay between
> each scan.  The delay may be as small as 100 milliseconds.
[quoted text clipped - 3 lines]
> scan the at least 80% of the objects may be removed.  Other times maybe
> 0%.  I can't give a item per second estimate.

Are you saying that:

(a) the only operation on the list is to scan it, removing items
    that meet a removal criterion and adding any items that are
    queued for addition?

(b) nothing ever searches the list for an item with a matching key value

I appreciate that the number of items added and removed during (a) may
be very variable, but you really need an estimates for:

- the average number of items removed during each scan.
- the average numbers of items added during each scan
- the number if times (b) happens for each occurrence of (a)

If you really can't estimate these statistics, have you got a real-life
data feed from which you can capture a decent sized sample? A few
thousand events should be plenty. By 'event' I mean:
- a scan being triggered
- an item's state being altered so it will be deleted on the next scan
- the arrival of an item to be inserted.
- a search being triggered
- ....

Ideally you need a time stamp or wait interval after the previous event
so you can run reproduceable real-time tests against your captured data,
but even without timestamp you can learn a *lot* by running tests
against a captured event list like this as you develop the code.

If you have a suitable data feed, then I suggest you build a first cut
class with the simplest possible storage structure and access methods.
Don't make any attempt to optimize it: simplicity is everything at this
stage. A good starting point would to choose a fixed size array that can
hold the entire set of events in the captured data. Write methods that
insert at the end, delete by shifting elements down, and search with a
serial scan. Instrument it and get some statistics off it. You'll learn
a lot from doing this and, most importantly, the statistics will show
you what needs optimization and what doesn't.

Remember that most programs spend 90% of their time executing under 5%
of their code: optimize that 5% and you can usually ignore the rest.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Larry Barowski - 19 Apr 2006 01:22 GMT
> The list will be iterated through every second.  The test case for
> removal is got from a method of the stored objects.  If value returned
> is true then it should be removed.  Items will be added and removed
> every second.  In this situation I can use a key to identify the
> indivual objects.  Does this mean i can't use a treemap?

If you don't ever need the items in a sorted order (an order
other than insertion order) then a TreeMap will be of no
benefit over a hashed data structure. If you never need to
look up items by some key, which is starting to sound like
the case, then a hashed data structure will be of no benefit.
In other words, if you're just adding new items to the end
of the list, scanning through the list to remove items, and
using the items in some way as an unordered set, then use
a LinkedList or an array or ArrayList (with the efficient
scan/remove procedure I posted earlier). Using a fixed-
size array that is big enough or doing your own array
growing/shrinking if necessary will be the fastest. Using
a LinkedList will result in the simplest code.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.