Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / August 2006

Tip: Looking for answers? Try searching our database.

Memory Allocation in Java

Thread view: 
Christopher Smith - 26 Aug 2006 15:48 GMT
Hi All -

Problem:  I have a large array of floating point numbers I need to look
for. These results come from a brut-force grid search, where the
coordinates (x,y) are non-parametric test results.

The problem is that the length of x and y, and thus the size of the grid is
quite large.  The length is a minimum of 120,000 both directions on the
grid, for a total of 14,400,000,000 possible combinations.  Which obviously
consumes a lot of memory -- somewhere on the order of 500 MB, if 32-bit
floating point.

I'm trying to think through the best way to handle this.  Currently, I'm
setting up an array Array[][] results = new Array[120000][120000].

Questions
When intitailizing the array, the ide usually hangs.  Is the compiler
trying to find contingous memory?
Is there a better way to handle this the results?  If I initialized a
vector of length 14.4 million, would that be faster?  Any ideas are greatly
appreciated.

Thanks, chris
Andrew Thompson - 26 Aug 2006 16:03 GMT
...
> Problem:  I have a large array of floating point numbers
...
> .....  Which obviously
> consumes a lot of memory -- somewhere on the order of 500 MB, if 32-bit
> floating point.

Note that you can increase the memory available to
a Java process.  See the -Xms/-Xmx flags.
<http://java.sun.com/j2se/1.5.0/docs/tooldocs/windows/java.html#Xms>

Andrew T.
Christopher Smith - 26 Aug 2006 18:11 GMT
> ...
>> Problem:  I have a large array of floating point numbers
[quoted text clipped - 8 lines]
>
> Andrew T.

Andrew -

Thanks for your response. I did try that. I use the vm option:  -Xmx768m.  
After looking  at the link, I realized that the x option is the cap, not
the initial size.  Let me rerun and see if that helps.
Andrew Thompson - 27 Aug 2006 02:55 GMT
> > ...
> >> Problem:  I have a large array of floating point numbers
[quoted text clipped - 6 lines]
> > a Java process.  See the -Xms/-Xmx flags.
> > <http://java.sun.com/j2se/1.5.0/docs/tooldocs/windows/java.html#Xms>

(Follow-Up relevant to comments made on other
messages in this thread)

AFAIU - the Xmx argument should do it.

To be honest, I couldn't find an 'anchor' to jump directly to
the Xmx description (damn Sun HTML!), but since it was
right beneath the #Xms anchor -  so I decided to mention
it at the same time.

OTOH - Eric made some interesting numbers fall out
of your specs..

Andrew T.
Patricia Shanahan - 26 Aug 2006 16:09 GMT
> Hi All -
>
[quoted text clipped - 19 lines]
>
> Thanks, chris

If it were a contiguous memory issue, that would go away if you
initialized each array separately - remembering you really have a
120,000 element array whose elements are references to 120,000 element
arrays of float.

Do you have enough swap space in the machine?

Do you have problems if you run it outside the IDE?

Patricia
Christopher Smith - 26 Aug 2006 18:13 GMT
>> Hi All -
>>
[quoted text clipped - 31 lines]
>
> Patricia

Gotcha.  Let me try it first with the right compiler flag (please see my
response to Andrew).

If that doesn't help, I think what you're saying is that I should
initialize the array along the X axis and then intiatialize an array
within each "row" -- for lack of a better term.  Am I getting this right?

Should have plent of mem.  Over 1GB of ram and healthy swap space
(windows xp prof is the current platform but the app will run regulary on
redhat.)

Also, I'll try outside the IDE.
Patricia Shanahan - 26 Aug 2006 20:14 GMT
...
> Gotcha.  Let me try it first with the right compiler flag (please see my
> response to Andrew).
>
> If that doesn't help, I think what you're saying is that I should
> initialize the array along the X axis and then intiatialize an array
> within each "row" -- for lack of a better term.  Am I getting this right?

I don't know that this will do any good, but it seems worth a try. Note
that it gives you the option of reporting progress. You could output a
message including memory statistics every 1000 rows. Even if it does not
work, that may give some more clues.

Patricia
Eric Sosman - 26 Aug 2006 20:52 GMT
> Hi All -
>
[quoted text clipped - 7 lines]
> consumes a lot of memory -- somewhere on the order of 500 MB, if 32-bit
> floating point.

    ... for suitable values of "somewhere on the order of."
120000 * 120000 * 4 = 57600000000 ~= 55000 MB ~= 54 GB.  Are
you sure the dimensions you've given are correct?

    If the dimensions are correct, I hope you have a 64-bit
JVM and a pretty substantial machine to work with.

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Patricia Shanahan - 27 Aug 2006 03:10 GMT
>> Hi All -
>>
[quoted text clipped - 14 lines]
>     If the dimensions are correct, I hope you have a 64-bit
> JVM and a pretty substantial machine to work with.

Good point. I didn't check the arithmetic on the memory size.

This raises a whole different set of issues. Maybe brute force is not
the way to go.

How many of the elements of the matrix are non-zero? Maybe this is a
case for sparse matrix techniques?

If dense, it might be better to keep it on disk. That raises a whole set
of issues of whether it is possible to batch and sort updates to the
matrix to reduce the amount of I/O.

Patricia
Christopher Smith - 27 Aug 2006 15:09 GMT
>>> Hi All -
>>>
[quoted text clipped - 28 lines]
>
> Patricia

Thanks for straightnening out my math.  

I do have horsepower, but don't want to tie up that many resources.

No, sparse matrix math won't work. Every field has a value.  

I guess divide and conquer is the right way to go.  What I can do is
splice the grid-search into quadrants, process each quadrant, report and
record the quadrant results (i.e., I'm searching for the Max within the
grid).  From there, it's just a matter of rolling through the quadrants.

Thanks for your help.

Chris
Eric Sosman - 27 Aug 2006 16:07 GMT
>>>>Hi All -
>>>>
[quoted text clipped - 34 lines]
>
> No, sparse matrix math won't work. Every field has a value.  

    It's a startlingly large number of values; may I ask where
they all came from?  Just curious, really.

    Amusing factoid: There are about 3.4 times as many fields
as there are distinct `float' values.

> I guess divide and conquer is the right way to go.  What I can do is
> splice the grid-search into quadrants, process each quadrant, report and
> record the quadrant results (i.e., I'm searching for the Max within the
> grid).  From there, it's just a matter of rolling through the quadrants.

    Could you explain the nature of this search a little more?
Simply "searching for the Max" in a big collection of numbers
requires very little memory; there's no need to retain a number
that's known to be non-maximal.

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Patricia Shanahan - 27 Aug 2006 16:12 GMT
>>>>> Hi All -
>>>>>
[quoted text clipped - 49 lines]
> requires very little memory; there's no need to retain a number
> that's known to be non-maximal.

Indeed. If that is the typical operation, I would just stick the numbers
in any convenient order in a disk file, and scan it sequentially,
keeping the max so far and its location.

For many other operations, there are known out-of-core algorithms that
are designed to do as much work as possible on a chunk while it is in
memory. Don't assume that the right algorithm if the problem fits in
memory is necessarily the right algorithm if it doesn't.

Patricia
Christopher Smith - 27 Aug 2006 17:57 GMT
>>>>>Hi All -
>>>>>
[quoted text clipped - 37 lines]
>      It's a startlingly large number of values; may I ask where
> they all came from?  Just curious, really.

Sure.  The numbers represent the (mean value/stdev value) of a longtudinal
time series, when tested against two conditions, x & y. In other words, it
looks to find the best result on daily basis historically when a sample is
tested against these two conditions (i.e., "in-sample").  Think of it as
looking back in time, testing against two conditions with perfect
knowledge, looking at the results for those two conditions, x & y, and find
those two conditional tests that give the best results over the time series
ex-post.  I actually want to test against six conditions when I'm done, but
am working which are the the two which explain return variance most
significantly.

>      Amusing factoid: There are about 3.4 times as many fields
> as there are distinct `float' values.

What is a "field" mean in this case?

>> I guess divide and conquer is the right way to go.  What I can do is
>> splice the grid-search into quadrants, process each quadrant, and hen
[quoted text clipped - 6 lines]
> requires very little memory; there's no need to retain a number
> that's known to be non-maximal.

I run through these test conditions over a time series, starting with the
lowest test parameter for x and for y. I then iterate over the time series,
looking at each time-record-entry for a set of conditions, including x&y,
if those conditions are true, and the values are greater than x or  y, I
then record the entry.  At the conclusion of the time series, I then
iterate the result to test again, with x += x + x_step and y += y + y_step.  
After retesting 120k^2 times, I then look across the grid to determine
which x and y test provide the maximum value.   This concludes the in-
sample testing. I then test out-of-sample to check projection error, but
that's a whole other discussion!  HTH.

I guess what you're saying is that I can discard all previous numbers each
time I find a new maximum?  If so, what would be the computational tradeoff
between storing the results and testing for the max result, as opposed to
testing for the max result as I go?

Chris
Patricia Shanahan - 27 Aug 2006 18:47 GMT
...
> I guess what you're saying is that I can discard all previous numbers each
> time I find a new maximum?  If so, what would be the computational tradeoff
> between storing the results and testing for the max result, as opposed to
> testing for the max result as I go?
...

In your situation, computation is the very least of your worries.
Getting the data in and, if necessary, out in an efficient fashion will
make far more difference to performance. Count the disk I/O. Look at
whether the I/O is sequential or random.

If saving some data produces a net reduction in data transfer, do it. If
recalculating does less I/O, do that.

Here are some tricks that may help - I don't understand your application
well enough to say which are best:

1. Make sure the input data is dense. If the file contains data that you
are not using, build an extracted file that contains only the data that
matters.

2. Consider working on rectangular chunks of matrix, finding the best
value for each chunk.

3. Consider keeping the input file in memory in a compact form, if it is
small enough.

However, the big message is to evaluate each strategy in terms of its
effect on data transfer.

Patricia
Eric Sosman - 27 Aug 2006 19:40 GMT
> Eric Sosman <esosman@acm-dot-org.invalid> wrote:
>>>[...]
[quoted text clipped - 13 lines]
> am working which are the the two which explain return variance most
> significantly.

    I'm still having trouble following your description.  You
started out with 120000 by 120000 32-bit floating-point values,
referred to the assemblage as a "grid," and spoke of the data
as being "non-parametric."  Okay, thus far I'm imagining a big
blob of numbers representing fourteen billion measurements on
a two-dimensional surface (not necessarily spatial, but 2D).
But now you're speaking of "mean value/stdev value," which implies
either that the measurements are grouped into sets by some rule
that hasn't been specified, or that there are in fact two sets of
fourteen billion numbers (fourteen billion means, fourteen billion
variances) that summarize a still larger set of measurements.  And
as if that weren't enough, you've thrown a time axis into the mix;
the data set is now to be understood as three-dimensional?  Or
perhaps more?

>>     Amusing factoid: There are about 3.4 times as many fields
>>as there are distinct `float' values.
>
> What is a "field" mean in this case?

    It was your term: "Every field has a value."  I assumed you
meant it to refer to one of the foutreen billion points of the
2D grid, but I'm no longer sure what you mean.

>>>I guess divide and conquer is the right way to go.  What I can do is
>>>splice the grid-search into quadrants, process each quadrant, and hen
>>>and record the quadrant results (i.e., I'm searching for the Max
>>>within the grid).  From there, it's just a matter of rolling through
>>>the quadrants.

    Note that each quadrant -- if by "quadrant" you mean "one
fourth" -- is still about 13.5 GB of data.  (Assuming "only"
fourteen billion 4-byte numbers.)

>>     Could you explain the nature of this search a little more?
>>Simply "searching for the Max" in a big collection of numbers
[quoted text clipped - 11 lines]
> sample testing. I then test out-of-sample to check projection error, but
> that's a whole other discussion!  HTH.

    You examine "over a time series," so that's one dimension of
the search.  But your search over x and y is apparently one-D also,
because you step both coordinates together to trace a "diagonal"
across the search area.  And then you "re"-test fourteen billion
times; what do you mean by "re"-test?  If this is the repetition,
what was the original (and why didn't it suffice)?

> I guess what you're saying is that I can discard all previous numbers each
> time I find a new maximum?  If so, what would be the computational tradeoff
> between storing the results and testing for the max result, as opposed to
> testing for the max result as I go?

    I'm withdrawing all advice on the grounds that the more you
explain what you're doing, the less I understand.  Good luck!

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Christopher Smith - 27 Aug 2006 23:24 GMT
>> Eric Sosman <esosman@acm-dot-org.invalid> wrote:
>>>>[...]
[quoted text clipped - 29 lines]
> the data set is now to be understood as three-dimensional?  Or
> perhaps more?

I guess I misunderstood your question. I thought you wanted to know how
those numbers in the 120k x 120k grid were calculated.  What I described
was the process to get those results.

>>>     Amusing factoid: There are about 3.4 times as many fields
>>>as there are distinct `float' values.
[quoted text clipped - 14 lines]
> fourth" -- is still about 13.5 GB of data.  (Assuming "only"
> fourteen billion 4-byte numbers.)

Right.  I would need to break the quadrants up into quite a few pieces.

>>>     Could you explain the nature of this search a little more?
>>>Simply "searching for the Max" in a big collection of numbers
[quoted text clipped - 19 lines]
> times; what do you mean by "re"-test?  If this is the repetition,
> what was the original (and why didn't it suffice)?

Let me see if a picture can help:

this is for one iteration.  X ranges from -5 to 5 and Y ranges from -10 to
10.
             
Time:                   1        2        3        4        5        6        7
Sample:                      10    12    -10   -2    5     2     6    

Pos(x test)               T     T                      T        T        T
else(y test)                             T     F   

Cond X (samp > 6):      T        T                      T        F        T       
Cond Y (samp > -4):                      F        F       
---------------------------------------------------------------
Time Series Result:        10    12           -2   5           6

The time series mean would be (10,12,-2,5,6)/5 = 6.2
and the standard deviation would be 4.83.
The ratio would be 6.2/4.83, or 1.28.

This is what is saved in the grid, in this case at coordinates 6,-4.

Does this make sense?  I've left out quite a few details to make it
simpler.

>> I guess what you're saying is that I can discard all previous numbers
>> each time I find a new maximum?  If so, what would be the
[quoted text clipped - 3 lines]
>      I'm withdrawing all advice on the grounds that the more you
> explain what you're doing, the less I understand.  Good luck!

Quick followup question about mathematics in java.  What's faster:

x += 12 or x = x + 12.   Just wondered if there was any difference?

Thanks, Chris
Oliver Wong - 28 Aug 2006 21:11 GMT
> Let me see if a picture can help:
>
[quoted text clipped - 11 lines]
> ---------------------------------------------------------------
> Time Series Result:    10    12           -2   5           6

   You should probably avoid tabs for fixed-width ASCII tables.

> The time series mean would be (10,12,-2,5,6)/5 = 6.2
> and the standard deviation would be 4.83.
[quoted text clipped - 4 lines]
> Does this make sense?  I've left out quite a few details to make it
> simpler.

   It's not clear what "This" refers to in "This is what is saved in the
grid".

>>> I guess what you're saying is that I can discard all previous numbers
>>> each time I find a new maximum?  If so, what would be the
[quoted text clipped - 7 lines]
>
> x += 12 or x = x + 12.   Just wondered if there was any difference?

   A good JVM should treat the two equivalently. If you're looking for
optimizations, it's not at the right level.

   - Oliver


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.