> >> I think you need to move beyond "somewhat informally" to nail down a
> >> conditional distribution.
[quoted text clipped - 20 lines]
> The problem is, it's not clear what you want. What does your application
> do?
OK. My application is a P2P caching web proxy and what I want to do is
load test the cache and get statistics such as file hit count, byte hit
count and the amount of data that has to be sent/recieved to/from an
origin server. I think the zipf distribution is a good choice of
distribution for web page popularity. But I would also like to
correlate web page popularity with file size. This correlation should
be adjustable between negative correlation, independence and positive
correlation. As such I can generate web page file requests with a given
popularity and file size which may or may not be dependent on the
popularity. I am still not 100 % convinced that there is a dependency
between file size and popularity but if there is I would like to cover
that base as well.
Patricia Shanahan - 15 Jun 2006 05:05 GMT
>>>> I think you need to move beyond "somewhat informally" to nail down a
>>>> conditional distribution.
[quoted text clipped - 32 lines]
> between file size and popularity but if there is I would like to cover
> that base as well.
I do have a truncated Zipf generator in Java, but it may not be suitable
for your purposes. I have a relatively small number of distinct items,
so I just calculate out the probability for each item, and then use a
discrete distribution generator.
For the page size, you still need to pick a distribution, and then
relate the distribution parameters to the page popularity.
Have you read
http://www.nslij-genetics.org/wli/zipf/breslau99.pdf? It looks relevant,
and although it is a few years old it may be a useful starting point for
finding papers that reference it. [Do you know how to use Citeseer?]
Patricia
Chris Uppal - 15 Jun 2006 11:20 GMT
> OK. My application is a P2P caching web proxy and what I want to do is
> load test the cache and get statistics such as file hit count, byte hit
> count and the amount of data that has to be sent/recieved to/from an
> origin server. I think the zipf distribution is a good choice of
> distribution for web page popularity.
OK, a fair hypothesis. I think you've already said that you know how to do
that bit.
> But I would also like to
> correlate web page popularity with file size. This correlation should
[quoted text clipped - 4 lines]
> between file size and popularity but if there is I would like to cover
> that base as well.
But this seems iffy to me -- the statistical equivalent of over-engineering.
You don't have a good reason to suspect a correlation. You don't know what the
correlation is, nor how it is parameterised. But you still want to model it
"just in case"...
So start simple. Have two file sizes, and either (a) choose randomly between
them, (b) choose the larger one for popular requests (all and only), (c) choose
the smaller one for popular requests (all and only). If you want to elaborate
a little more, choose a file size with a randomly varying value around one or
other mean. If you want to get more elaborate still;;, choose a file size with
a mean varying randomly around a mean derived from the popularity by some
simple formula.
No that's not statistically sound (or so I assume), but you are not in a
position to /be/ statistically sound -- since you have no data to model, nor a
theory to yield an analytic model. So all you are doing -- all you /can/ do --
is a simple test to see if performance is obviously sensitive to such
correlations. So, to borrow a phrase from XP, use the Simplest Correlation
That Could Possibly Work.
-- chris