Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

2D - bivariate random number generatin

Thread view: 
John - 06 Jun 2006 07:09 GMT
Hi All, I have created the following class to uniformly generate
numbers according to a zipf distribution. How can I extend this program
so that I can generate two data points with the requirement that the
first data point is correlated with the first? ie. the secondpoint will
vary with the first? any ideas? thanks in advance.

import java.util.Random;

/*
    Basically a simple program to generate random
    variables which approximate a Zipf distribution
*/

public class basic_zipf
{
    /* Calculate the generalized harmonic number of order N.
    Basically, the sum of the reciprocals. */

    public final static void harmonic(double exponent, double[]
sum_reciprocals)
    {
        double sum = 0.0;

        for (int i = 1; i <= sum_reciprocals.length; i++)
        {
            sum += 1.0 / Math.pow(i,exponent);
           sum_reciprocals[i-1]=sum;
        }
    }

    public static void main(String[] args)
   {
       int numer_of_files=1000;

        /* random number generator */
        Random random=new Random();
        double exponent=1.0;

        /* get the harmonic number of order 30 */
        //double sum = harmonic(numer_of_files,exponent);

        /* lookup table for each reciprocal value 1 ... 30 */
        double reciprocals[]=new double[numer_of_files];

        /* calculate harmonic numbers */
        harmonic(exponent,reciprocals);

        double sum=reciprocals[numer_of_files-1];

        int[] variates=new int[numer_of_files];

        /* generate 10,000 random variables */
        for(int i=0;i<500000;i++)
        {
            // scale the random variable to the range [0,sum]
            double tmp=random.nextDouble()*sum;
            double range=0;

            // bin the scaled random number and output the bin number
            for(int j=0;j<numer_of_files;j++)
            {
                // find out which range the value is in
                double upper=reciprocals[j];

                if((tmp>=range) && (tmp<=upper))
                {
                    //System.out.println(j+1);
                    variates[j]++;
                    break;
                }
                else
                {
                    range=upper;
                }
            }
        }

        for(int i=0;i<variates.length;i++)
        {
            System.out.println((i+1)+","+variates[i]);
           
        }
    }
}
Patricia Shanahan - 06 Jun 2006 14:41 GMT
> Hi All, I have created the following class to uniformly generate
> numbers according to a zipf distribution. How can I extend this program
> so that I can generate two data points with the requirement that the
> first data point is correlated with the first? ie. the secondpoint will
> vary with the first? any ideas? thanks in advance.

Can you describe the conditional distribution of the second variable,
given the first?

Patricia
John - 06 Jun 2006 15:25 GMT
> > Hi All, I have created the following class to uniformly generate
> > numbers according to a zipf distribution. How can I extend this program
[quoted text clipped - 4 lines]
> Can you describe the conditional distribution of the second variable,
> given the first?

Yes I can (somewhat informaly), what I want to do is model web browsing
behaviour. So given a variable X say the popularity of a particular
file there will be a covariant random number specifying the size of the
file. The number will also vary according to a zipf distribution. I was
thinking of using a 2D array with the required densities but am unsure
as to how to proceed. The second variable the file size will be smaller
for a popular file and larger for an unpopular file. Any ideas ?
Oliver Wong - 06 Jun 2006 18:23 GMT
>> > Hi All, I have created the following class to uniformly generate
>> > numbers according to a zipf distribution. How can I extend this program
[quoted text clipped - 12 lines]
> as to how to proceed. The second variable the file size will be smaller
> for a popular file and larger for an unpopular file. Any ideas ?

   You're saying that if a file is popular, it's likely that it's small,
and vice versa? Is it exactly the case that the nth most popular file is the
nth smallest file?

   - Oliver
Patricia Shanahan - 06 Jun 2006 21:15 GMT
>>> Hi All, I have created the following class to uniformly generate
>>> numbers according to a zipf distribution. How can I extend this program
[quoted text clipped - 11 lines]
> as to how to proceed. The second variable the file size will be smaller
> for a popular file and larger for an unpopular file. Any ideas ?

I think you need to move beyond "somewhat informally" to nail down a
conditional distribution.

The 2D array idea amounts to storing the joint distribution for
popularity and size. It can be calculated from:

P(popularity==p && size==s)
  = P(size==s | popularity==p)*P(popularity==p).

You have a good choice of distribution for popularity. Whether it is a
good idea to pre-evaluate the joint distribution and store the
probabilities in an array depends on the nature of the conditional
distribution. If it is easy to evaluate on the fly, there is no need for
what could turn out to be an inconveniently large array. There are a lot
of web pages.

Patricia
John - 14 Jun 2006 06:39 GMT
> >>> Hi All, I have created the following class to uniformly generate
> >>> numbers according to a zipf distribution. How can I extend this program
[quoted text clipped - 27 lines]
> what could turn out to be an inconveniently large array. There are a lot
> of web pages.

I've been reading up on non-uniform random variate generation
techniques in the book "Non-Uniform Random Variate Generation" by Luc
Devroye. Anyway it seems that there is quite a steep learning curve.
Are you aware of any code examples that do something similar to what I
want ? I have found a number of examples that deal with bivariate
generation for normal distributed variables using a covariance matrix.
I just wonder if these are also applicable to what I want. Ideally I
want something that can be tuned in terms of correlation. thanks for
all your help.
Chris Uppal - 14 Jun 2006 09:38 GMT
> I've been reading up on non-uniform random variate generation
> techniques in the book "Non-Uniform Random Variate Generation" [...]

Wow!

(Sorry, couln't resist.  I'll leave it to someone else to provide a sensible
answer...)

   -- chris
Oliver Wong - 14 Jun 2006 14:57 GMT
>> I think you need to move beyond "somewhat informally" to nail down a
>> conditional distribution.
[quoted text clipped - 17 lines]
> Are you aware of any code examples that do something similar to what I
> want ?

   The problem is, it's not clear what you want. What does your application
do?

   - Oliver
John - 14 Jun 2006 23:51 GMT
> >> I think you need to move beyond "somewhat informally" to nail down a
> >> conditional distribution.
[quoted text clipped - 20 lines]
>     The problem is, it's not clear what you want. What does your application
> do?

OK. My application is a P2P caching web proxy and what I want to do is
load test the cache and get statistics such as file hit count, byte hit
count and the amount of data that has to be sent/recieved to/from an
origin server. I think the zipf distribution is a good choice of
distribution for web page popularity. But I would also like to
correlate web page popularity with file size. This correlation should
be adjustable between negative correlation, independence and positive
correlation. As such I can generate web page file requests with a given
popularity and file size which may or may not be dependent on the
popularity. I am still not 100 % convinced that there is a dependency
between file size and popularity but if there is I would like to cover
that base as well.
Patricia Shanahan - 15 Jun 2006 05:05 GMT
>>>> I think you need to move beyond "somewhat informally" to nail down a
>>>> conditional distribution.
[quoted text clipped - 32 lines]
> between file size and popularity but if there is I would like to cover
> that base as well.

I do have a truncated Zipf generator in Java, but it may not be suitable
for your purposes. I have a relatively small number of distinct items,
so I just calculate out the probability for each item, and then use a
discrete distribution generator.

For the page size, you still need to pick a distribution, and then
relate the distribution parameters to the page popularity.

Have you read
http://www.nslij-genetics.org/wli/zipf/breslau99.pdf? It looks relevant,
and although it is a few years old it may be a useful starting point for
finding papers that reference it. [Do you know how to use Citeseer?]

Patricia
Chris Uppal - 15 Jun 2006 11:20 GMT
> OK. My application is a P2P caching web proxy and what I want to do is
> load test the cache and get statistics such as file hit count, byte hit
> count and the amount of data that has to be sent/recieved to/from an
> origin server. I think the zipf distribution is a good choice of
> distribution for web page popularity.

OK, a fair hypothesis.  I think you've already said that you know how to do
that bit.

> But I would also like to
> correlate web page popularity with file size. This correlation should
[quoted text clipped - 4 lines]
> between file size and popularity but if there is I would like to cover
> that base as well.

But this seems iffy to me -- the statistical equivalent of over-engineering.
You don't have a good reason to suspect a correlation.  You don't know what the
correlation is, nor how it is parameterised.  But you still want to model it
"just in case"...

So start simple.  Have two file sizes, and either (a) choose randomly between
them, (b) choose the larger one for popular requests (all and only), (c) choose
the smaller one for popular requests (all and only).  If you want to elaborate
a little more, choose a file size with a randomly varying value around one or
other mean.  If you want to get more elaborate still;;, choose a file size with
a mean varying randomly around a mean derived from the popularity by some
simple formula.

No that's not statistically sound (or so I assume), but you are not in a
position to /be/ statistically sound -- since you have no data to model, nor a
theory to yield an analytic model.  So all you are doing -- all you /can/ do --
is a simple test to see if performance is obviously sensitive to such
correlations.  So, to borrow a phrase from XP, use the Simplest Correlation
That Could Possibly Work.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.