Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2005

Tip: Looking for answers? Try searching our database.

Extreamly large Hashtable

Thread view: 
anthony@station5.net - 06 May 2005 16:52 GMT
Has anyone created an extreamly large Hashtable? I need to create a
simple look up table key/value of some information.

I'm assuming that if it is in memory it will be faster then looking
this value up in a database.

The problem is I have about 4,000,000 rows of data. Assuming I have
enough memory to hold it, will a hashtable break being that large? Will
it be fast? Each row will only be about 10k in size.

Is there a better way?

Thanks,
AJS
David McDivitt - 06 May 2005 17:20 GMT
A database?

>Subject: Extreamly large Hashtable
>Date: 6 May 2005 08:52:28 -0700
[quoted text clipped - 13 lines]
>Thanks,
>AJS
Betty - 06 May 2005 17:22 GMT
> Has anyone created an extreamly large Hashtable? I need to create a
> simple look up table key/value of some information.
[quoted text clipped - 10 lines]
> Thanks,
> AJS

Surely you jest: 4000000 * 10000 = 40000000000;
Chris Uppal - 06 May 2005 17:30 GMT
> The problem is I have about 4,000,000 rows of data. Assuming I have
> enough memory to hold it, will a hashtable break being that large?

There's no reason why a hashtable with that many entries should not function
perfectly well.  The critical question is whether the hash function is good
enough to give a reasonable spread among so many entries.  I'd guess that the
chances are that it is.

But...

> Each row will only be about 10k in size.

So you are talking about 40 GBytes of memory in total ?  Unless you have very
specialised requirements -- and have the cash to buy specialised machinery to
implement those requirements -- that sounds a /little/ on the large side to
me...

I'm almost certain that you'd be better off using a proper database.  They are
optimised for handling large amounts of data without using unreasonable (large,
yes, but not unreasonable) amounts of memory.

   -- chris
Eric Sosman - 06 May 2005 18:35 GMT
> Has anyone created an extreamly large Hashtable? I need to create a
> simple look up table key/value of some information.
[quoted text clipped - 5 lines]
> enough memory to hold it, will a hashtable break being that large? Will
> it be fast? Each row will only be about 10k in size.

   Four megarows times ten kilobytes per row equals forty
gigabytes.  Are you sure you have that much RAM available?
If you don't, you'll just trade database I/O for swap I/O.

   Where does this 40GB come from?  If you can read it at
100 MB/s (from a well-tuned RAID stripe, say), it'll take
about seven minutes to pull in all the data and load the
table.  Divide this seven minutes by the expected savings
per query to get the number of queries you must process in
order to recoup the startup time.

   Note that the 10 KB row size is irrelevant to the table's
performance (unless it means that the keys' equals() and
hashCode() methods are slow).  The table contains only the
references to the objects (that's another 7 GB, at a guess),
not the objects' data fields.

   The above are just a few miscellaneous thoughts -- maybe
if you described your problem in more detail someone would be
able to comment more particularly.

Signature

Eric.Sosman@sun.com

anthony@station5.net - 06 May 2005 19:47 GMT
Would a in-memory database product be any better, or would it be just
about the same.

Basically you can think of it like a phone book system, keyed on a
name. So I have a  name and an address. I would use the system to
verify that the name is a known name. I need it to run as fast as
humanly possible.  Would a simple array be faster?(assuming I could
replace the key to be numeric).

AJS
Betty - 06 May 2005 20:13 GMT
> Would a in-memory database product be any better, or would it be just
> about the same.
[quoted text clipped - 6 lines]
>
> AJS

An in memory database would need at least as much memory unless
you can find one that pages to disk. Of course you would want to
create an index so you don't have to sequentially search the key
attribute ;-)
Eric Sosman - 06 May 2005 20:32 GMT
> Would a in-memory database product be any better, or would it be just
> about the same.
[quoted text clipped - 4 lines]
> humanly possible.  Would a simple array be faster?(assuming I could
> replace the key to be numeric).

   "As fast as humanly possible ..."  Give me "all the
money in the world" and I'll build it for you.

   In other words, you've got to get more specific about
your requirements.  How fast do you need the response to
be?  To put it another way, if one millisecond is too slow
how much money are you prepared to spend to reduce it to
a microsecond?  To a nanosecond?  To an attosecond (which
will require much more than "all the money in the world")?

Signature

Eric.Sosman@sun.com

anthony@station5.net - 06 May 2005 21:36 GMT
I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

AJS
Eric Sosman - 06 May 2005 22:55 GMT
> I hear you, but I mean within the confines of the language what would
> be the fastest way of accessing in-memory data of such a large list of
> values.

   STILL no specifics ...  Well, I'll do what I can.

   "Within the confines of the language" -- well, you've
already said you want to use a HashTable, so that's that.
HashMap might be better -- or not; depends on what you're
doing, and I still don't know enough.

   40 GB of "raw data" plus a few more GB of object
references and other such "metadata" -- this implies a
JVM with 64-bit addressing.

   You'll also need about 64 GB of RAM to hold all that
data, metadata, your classes, the JVM, and the O/S.  The
machine isn't going to do much of anything besides serving
up the data for you.

   You'll probably need a 64-bit O/S to manage that much
memory -- it's possible for an O/S to manage more memory
than it can address, but that style of thing seems to have
fallen out of favor.  Solaris, AIX, some Linux distros,
maybe that brand-new Windows (if you can find a 64-bit
JVM for it).  Dunno about Mac; dunno about *BSD; not sure
about zLinux.  OS/2 fans need not apply.

   Alternatively, you could break up the data into, say,
sixteen chunks of a quarter-million rows each (2.5 GB)
and spread the load across sixteen machines with 4 GB of
RAM running 32-bit JVMs and the O/S of your choice.  A
seventeenth machine could field the query and route it,
or you could submit each query to all sixteen servers and
"batch" the results.  This could be pretty fast if you
tie the machines together in something like an Infiniband
fabric.

   Still not fast enough?  Okay: Deploy a hundred such
machines each handling 40,000 rows, and architect the
routing for high parallelism.  Or use a hundred machines
each handling 400,000 rows (so each row appears on ten
different machines) so you can route each query to the
least-busy machine that can satisfy it.

   Still not enough?  No problem: Break open the piggy
bank, and Sun or IBM or SGI or Cray or somebody will build
you a great big computing grid populated with the biggest,
baddest iron they make.  It'll help if you live near a
hydroelectric or nuclear power plant; this solution may
require a little more electricity than ordinary office
wiring can supply.

   Do you see yet, andrew, why "What's the fastest" is the
wrong question?  You've removed all other constraints, so
you get answers that may be in some sense "right" but are
in no sense "useful" -- I think I probably exceeded your
likely budget several paragraphs ago.  The right question
is something like "How do I keep the average (or worst case,
or 90th percentile) response below X milliseconds, and can
it be done with fewer than Y systems on a budget of Z?"
That's something people can deal sensibly with -- but these
"The sky's the limit" questions get nowhere, and rapidly.

Signature

Eric.Sosman@sun.com

John C. Bollinger - 06 May 2005 23:00 GMT
> I hear you, but I mean within the confines of the language what would
> be the fastest way of accessing in-memory data of such a large list of
> values.

If you have keys with good hash codes then a HashMap is probably among
the best choices for lookup performance on in-memory data.  The problem
is that you have too much data to keep it all in memory at once on any
conceivable 32-bit computer, so the question is probably moot.  If you
have available a 64-bit computer with >~ 50GB of RAM, a 64-bit OS, and a
64-bit JVM then the situation might be different, but in that case you
should also have adequate expertise in-house that you wouldn't need to
pose your question here.  (And I note that the time to load the whole
behemoth into memory from disk would be very substantial, too -- I hope
an application startup time measured in tens of minutes is acceptable.)

What you are postulating is silly.  Management of such large amounts of
data is what traditional databases are _for_.

Signature

John Bollinger
jobollin@indiana.edu

Chris Uppal - 07 May 2005 08:06 GMT
> (And I note that the time to load the whole
> behemoth into memory from disk would be very substantial, too -- I hope
> an application startup time measured in tens of minutes is acceptable.)

You could store file-offsets in the main hashtable, and only load the row data
from the backing file on-demand.  If you also arrange some sort of LRU to evict
row data at need then you could even manage to run the thing on a 32-bit box.

Needless to say, for most purposes, that would just be a slower, and more
difficult, way of getting only half the benefits of using a real database.
However, there /are/ circumstances where roll-your-own can provide substantial
benefits, but they are pretty rare (although, oddly enough, I've had to do it
twice).

   -- chris
David McDivitt - 06 May 2005 23:04 GMT
>Subject: Re: Extreamly large Hashtable
>Date: 6 May 2005 13:36:01 -0700
>
>I hear you, but I mean within the confines of the language what would
>be the fastest way of accessing in-memory data of such a large list of
>values.

That would be a binary search of already sorted keys. Probably no more than
seven or eight tests would be required to find any key. An alternative would
be to use a flat file having presorted fixed length records. Instead of
accessing memory you would read whatever offset into the file, which would
be essentially the same thing, and it would only take seven or eight reads
to get the desired record. To update you simply place a new flat file on the
server, created by another process. If I really needed speed and infinite
size I would do it this way.
Murat Tasan - 07 May 2005 00:33 GMT
>>Subject: Re: Extreamly large Hashtable
>>Date: 6 May 2005 13:36:01 -0700
[quoted text clipped - 5 lines]
> That would be a binary search of already sorted keys. Probably no more than
> seven or eight tests would be required to find any key.

in the original post, it was said that this will be used to search for
existence in the database, indicating that probably a good number of
searches will fail.  this means that the average case for search will
trickle down near the leaves of this binary search.

log(4e6) / log(2) = 20.  (approximately)

thus generally at least 20 comparisons will be needed.  now, to decide
which is a better solution, one needs to examine the hashtable properties.
how long does it take to compute the hash function?  then, how are
collisions resolved?  assuming chaining, you need to find the average
length of the lists.

if the function takes x operations (Where each operation is about the
cost of a comparison), and the average length of each chain will be y,
then the average number of comparisons for a successful search will be x +
(y / 2).

although, in this case since search will probably involve many
non-existent entries in the database, the average hashtable cost will be
closer to x operations.

if x << 20, then use a hashtable.

store the actual records on disk and keep references only in the table.

if each disk reference is 4B, and each key is 4B, then the size of the
table is approximately 4e6 * (4B + 4B) = 32e6B, or 32MB.  actually 32MB
is certainly a lower bound, and in experience the overhead of the
structures is often larger than the size of the stored data.  but assuming
512MB or even 1GB of main memory (a reasonable assumption these days),
this is certainly a feasible project.

now, if you are getting alot of positive hits (i.e. searches that match),
you will probably be bottlenecked by disk IO.  but that's a whole 'nother
story...

murat
Chris Uppal - 07 May 2005 07:58 GMT
> > I hear you, but I mean within the confines of the language what would
> > be the fastest way of accessing in-memory data of such a large list of
> > values.
>
> That would be a binary search of already sorted keys. Probably no more
> than seven or eight tests would be required to find any key.

If a sufficiently good hash function is available, and space is not a problem,
then a big hash table can /always/ beat a big binary lookup.  Hashtables can be
made O(1) (with more-or-less any desired constant factor, trading space for
speed), whereas binary lookup is O(log n).

   -- chris
David McDivitt - 09 May 2005 16:01 GMT
>From: "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
>Subject: Re: Extreamly large Hashtable
[quoted text clipped - 11 lines]
>made O(1) (with more-or-less any desired constant factor, trading space for
>speed), whereas binary lookup is O(log n).

A binary search is the fastest and most easily implemented scheme, when it
can be used. This is because binary searches are at the core of any other
lookup scheme, including databases going through an index. By laying out a
sorted flat file and doing a low level binary search directly on that flat
file, much overhead is eliminated. If records are large, then a key and
offset value can be obtained from the flat file, and the offset used to
fetch actual data from another file. For variable length records obtain
offset and length from the first file and use to fetch from the second file.
Binary searches in this manner are greatly enhanced by the base operation
system which caches disk info, removing that load from the application
completely.

The original question was for alternatives to a ridiculously large hash
table. A small or reasonably sized hash table would certainly be better than
a binary search on disk.
John C. Bollinger - 09 May 2005 20:03 GMT
[Chris Uppal wrote:]
["anthony" wrote:]
>>>>I hear you, but I mean within the confines of the language what would
>>>>be the fastest way of accessing in-memory data of such a large list of
[quoted text clipped - 11 lines]
> can be used. This is because binary searches are at the core of any other
> lookup scheme, including databases going through an index.

Your claim is much too strong.  Binary searching is a widely used
technique, but it is manifestly false that a binary search is used in
any way for lookups against unsorted data.  It is also not generally
true that binary searching is used as part of looking up data via a hash
table.  Furthermore, I'm uncertain whether you did not understand Chris
comments about asymptotic complexity or whether you simply chose to
ignore them, but they are irrefutable: in the limit of many data and
given a reasonable hashing function, a hash table will *always*
outperform a binary search.

As for ease of implementation, generic binary searching and hash lookups
are both available in the Java platform library, so I don't see much
difference there.  If anything, it's a bit trickier to ensure that the
underlying data is and remains sorted (for a binary search) than it is
to just add objects to a HashMap.

Signature

John Bollinger
jobollin@indiana.edu

Lee Fesperman - 09 May 2005 21:59 GMT
> A binary search is the fastest and most easily implemented scheme, when it
> can be used. This is because binary searches are at the core of any other
[quoted text clipped - 7 lines]
> system which caches disk info, removing that load from the application
> completely.

Databases on disk going through an index (rather than a hash) will generally use a
b-tree scheme these days. B-tree is not really a binary search and is much faster for
searching disk, even more with a good cache. For example, maintaining the top of the
tree in memory yields significant improvements because they tend to be balanced (a big
issue with binary tree). They also don't normally sort the data records thus allowing
multiple indexes for the same data. Generally, b-tree is preferred because it allows
range searching and ordered sub-sets which hashing doesn't. All of this is more
appropriate for database purposes.

Signature

Lee Fesperman, FFE Software, Inc. (http://www.firstsql.com)
==============================================================
* The Ultimate DBMS is here!
* FirstSQL/J Object/Relational DBMS  (http://www.firstsql.com)

Chris Uppal - 10 May 2005 11:21 GMT
> A binary search is the fastest and most easily implemented scheme, when it
> can be used.

I don't know if you really mean this as it sounds, but taken literally it is
quite simply false.

> This is because binary searches are at the core of any other
> lookup scheme, including databases going through an index.

The same goes for this.

It /is/ true that tree-based indexing is the norm in high-performance DB
indexing, but note that:

a) the "binary" search in question isn't a binary chop over a sorted array --
it'll be a complex implementation of some sophisticated balanced tree
algorithm.

b) the reason for using a balanced tree for the index rather than hashing has
more to do with the efficiency with which elements can be added and removed,
than lookup efficiency.  In fact a hashtable (with a good hash) will beat a
balanced tree implementation for lookup.

> By laying out a
> sorted flat file and doing a low level binary search directly on that flat
> file, much overhead is eliminated.

Very inefficient.  Comparing just the case of a hastable laid out as a large
flat file, vs the same data laid out in sort order and accessed via a binary
chop, the hashtable will be /far/ faster.  The problem is that the sorted
version will require O(log n) disk seeks to find the wanted element, whereas
the hashtable will require 1 disk seek (typically) to do the same.  Disk seeks
are /slow/, doing just 10 of them consumes a substantial fraction of a second.

OTOH, the hashtable would have to be larger, in order to keep hash collisions
to an acceptable minimum.  But it's surprising how little extra space is needed
in this particular case.  A hashtable with a load as high as 90% works fine for
this, which is not what you'd normally expect.  It's because the access time is
dominated by disk-seek time, and one seek followed by a read of a block of data
from the disk will normally pull in quite a few keys (depending on key-size, of
course) and provided that the hash collision can be resolved without needing to
read another block, the cost of the extra comparisons is essentially zero.

Note that both implementations would suffer from the problem that inserting or
deleting records could be /very/ expensive.  The sorted table would suffer
rather worse from this problem than the hashtable, and the jiggery-pokery
required to reduce the problems would be harder to implement for the sorted
table, but neither is really suitable for anything except very special
purposes.

   -- chris
Eric Sosman - 09 May 2005 15:50 GMT
>>Subject: Re: Extreamly large Hashtable
>>Date: 6 May 2005 13:36:01 -0700
[quoted text clipped - 5 lines]
> That would be a binary search of already sorted keys. Probably no more than
> seven or eight tests would be required to find any key.

   lg(4000000) ~= 22, hence an average of ~21 probes for
a successful search, ~22 for worst-case success or for an
unsuccessful search.

> An alternative would
> be to use a flat file having presorted fixed length records. Instead of
> accessing memory you would read whatever offset into the file, which would
> be essentially the same thing, and it would only take seven or eight reads
> to get the desired record.

   To choose one of 4000000 items in "seven or eight" probes,
each probe needs to produce 6.7-8.8 outcomes.  An eight-way
tree would do the job in ~6.3 probes average, ~7.3 worst case,
which more or less fits your description -- except that this
is the first time I've seen an eight-way tree described as a
"flat file!"

> To update you simply place a new flat file on the
> server, created by another process. If I really needed speed and infinite
> size I would do it this way.

   Seven or eight I/O's per probe -- or twenty-one if you use
binary search -- is not "speed" in my book.  It'll surely take
at least as long (probably longer) than the database the O.P.
already rejected as too slow.  (Well, not exactly: He rejected
it as "not the fastest;" we still don't know how much speed he
actually needs.)

Signature

Eric.Sosman@sun.com

David McDivitt - 06 May 2005 22:58 GMT
>Subject: Re: Extreamly large Hashtable
>Date: Fri, 06 May 2005 15:32:47 -0400
[quoted text clipped - 17 lines]
>a microsecond?  To a nanosecond?  To an attosecond (which
>will require much more than "all the money in the world")?

Use a database and keep the connection hot, possibly instantiating the
driver and connection object within the lookup class. This is not a good
practice but it would allow for minimum overhead. If a database error
occurred, destroy the objects, recreate, and try again. This would be the
fastest possible solution. Beyond that you may need to optimize the database
for fastest access, but it is doubtful that would be an issue. You would
need a close method to destroy everything to be called at application
shutdown.
William Brogden - 07 May 2005 18:17 GMT
> Would a in-memory database product be any better, or would it be just
> about the same.
[quoted text clipped - 6 lines]
>
> AJS

If you just need to verify that the name is known, look into the
extremely compact way that spell checkers store words. Essentially
you would have a spell checker full of names.

Bill
Bjorn Borud - 09 May 2005 21:11 GMT
["William Brogden" <wbrogden@bga.com>]

| If you just need to verify that the name is known, look into the
| extremely compact way that spell checkers store words. Essentially
| you would have a spell checker full of names.

yes, two things one may want to look up are Bloom Filters and Finite
State Automata.

-Bjørn
Boudewijn Dijkstra - 06 May 2005 22:54 GMT
>> Has anyone created an extreamly large Hashtable? I need to create a
>> simple look up table key/value of some information.
[quoted text clipped - 9 lines]
> gigabytes.  Are you sure you have that much RAM available?
> If you don't, you'll just trade database I/O for swap I/O.

I believe the OP was talking about keeping the hashtable in memory, not the
entire database.

>    Where does this 40GB come from?  If you can read it at
> 100 MB/s (from a well-tuned RAID stripe, say), it'll take
> about seven minutes to pull in all the data and load the
> table.  Divide this seven minutes by the expected savings
> per query to get the number of queries you must process in
> order to recoup the startup time.

Right.  Mind that, to binary search such a record in such a database with a
load of 75%, it takes about 39 times a disk access and a compare of two 10 kB
blocks.  Subtract from this, about 4/3 disk accesses and a hashcode
computation, to get the expected savings.

>    Note that the 10 KB row size is irrelevant to the table's
> performance (unless it means that the keys' equals() and
> hashCode() methods are slow).  The table contains only the
> references to the objects (that's another 7 GB, at a guess),
> not the objects' data fields.

Four million times 8 bytes (4-byte hashcode, 4-byte index) at a load of 75% is
still less than 40 MB.  Where is your guess based on?
Wibble - 07 May 2005 03:25 GMT
If your key is actually names, consider using SOUNDEX.  It
makes for a very tight key compression and its tolerant of
spelling errors.

>>>Has anyone created an extreamly large Hashtable? I need to create a
>>>simple look up table key/value of some information.
[quoted text clipped - 33 lines]
> Four million times 8 bytes (4-byte hashcode, 4-byte index) at a load of 75% is
> still less than 40 MB.  Where is your guess based on?
Eric Sosman - 09 May 2005 16:48 GMT
>>>Has anyone created an extreamly large Hashtable? I need to create a
>>>simple look up table key/value of some information.
[quoted text clipped - 12 lines]
> I believe the OP was talking about keeping the hashtable in memory, not the
> entire database.

   That's possible; the O.P. has not divulged much about
what he's trying to do.  I thought he intended to keep the
whole thing in memory because he specifically mentioned the
row size; if he were intending just to store the keys and
a disk pointer he'd have talked about the key size.  Also,
he seemed to have a speed fetish, and speed is incompatible
with I/O ...

>>   Note that the 10 KB row size is irrelevant to the table's
>>performance (unless it means that the keys' equals() and
[quoted text clipped - 4 lines]
> Four million times 8 bytes (4-byte hashcode, 4-byte index) at a load of 75% is
> still less than 40 MB.  Where is your guess based on?

   I may have guessed wrong; let me try again.  The Hashtable
itself contains four million references to Map.Entry objects;
in a 64-bit JVM (on the load-all-the-data assumption) they'll
take eight bytes apiece for 32 MB.

   Each Map.Entry object holds references to a key and a value
(sixteen bytes) plus some amount of overhead to express its own
"objectivity" -- I don't know how much that is, but let's assume
it's something like sixteen bytes.  That's thirty-two bytes for
each of the four million Map.Entry objects, or 128 MB.

   Grand total for metadata: about 160 MB, "just a trifle" less
than my original guess.  Even if the Map.Entry overhead is more
like thirty-two bytes and even if the Hashtable itself stores
more than just the Map.Entry reference, the total won't come
close to my 7 GB blunder ...  (Memo to self: Stop trying to do
back-of-the-envelope calculations on envelope-less E-mail ;-)

Signature

Eric.Sosman@sun.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.