Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2005

Tip: Looking for answers? Try searching our database.

Riddle me this

Thread view: 
Sharp Tool - 06 Nov 2005 08:46 GMT
Hi

Consider this list of numbers:

12.0
5.0
1.0
-0.1
-2.1
-124.0

what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?

So far i have:

cuff off = smallest positive - smallest difference in negative pairs
          = 1.0 - (2.1 - 0.1)
          = 1.0 - 2.0
          = -1.0

Problem is that would eliminate - 2.1!

Help appreciated.
Sharp Tool
Roedy Green - 06 Nov 2005 09:26 GMT
On Sun, 06 Nov 2005 08:46:17 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>what algorithm to use to remove large negative values such as -124.0?
>how to determine a cutoff value that is statistically meaningful?

That is not usually a statistical question but a plausibility
question.  If you are scanning data for  temperatures of Honolulu you
would look at history, give yourself a safety factor, and chop below
and above a given range.

Readings for human temperatures would have a narrower range unless you
included corpses.

If your numbers fit a normal bell shaped curve, you can compute the
mean and standard deviation. Then you could throw out numbers more
than n deviations from the mean.


Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Thomas Hawtin - 06 Nov 2005 09:29 GMT
>  
> what algorithm to use to remove large negative values such as -124.0?
> how to determine a cutoff value that is statistically meaningful?

This newsgroup probably isn't the best place to find statisticians
(although I guess there are a few).

You could google for "outliers" or similar. "Grubbs' Test for Outliers"
seems like a step in the right direction.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Sharp Tool - 07 Nov 2005 08:37 GMT
>> Sharp Tool wrote:
> >
[quoted text clipped - 8 lines]
>
> Tom Hawtin

Grubbs Test is only suitable for data that has a normal distribution - mine
does not.

Cheers
Sharp
Thomas G. Marshall - 09 Nov 2005 04:46 GMT
Thomas Hawtin coughed up:

>> what algorithm to use to remove large negative values such as -124.0?
>> how to determine a cutoff value that is statistically meaningful?
>
> This newsgroup probably isn't the best place to find statisticians
> (although I guess there are a few).

No, but comp.programming often has quite a few folks from many mathematics
related fields, statistics being one of them.

...[rip]...

Signature

I've seen this a few times--Don't make this mistake:

Dwight: "This thing is wildly available."
Smedly: "Did you mean wildly, or /widely/ ?"
Dwight: "Both!", said while nodding emphatically.

Dwight was exposed to have made a grammatical
error and tries to cover it up by thinking
fast.  This is so painfully obvious that he
only succeeds in looking worse.

Thomas Hawtin - 09 Nov 2005 10:27 GMT
> Thomas Hawtin coughed up:
>>
[quoted text clipped - 3 lines]
> No, but comp.programming often has quite a few folks from many mathematics
> related fields, statistics being one of them.

There's probably more people who could help with my PC problems. I
insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D
315 and it doesn't boot. Please advice.

Tom Hawtin, BSc (Hons) Mathematics
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Chris Uppal - 09 Nov 2005 10:41 GMT
> There's probably more people who could help with my PC problems. I
> insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D
> 315 and it doesn't boot. Please advice.

I think there's a Jakarta commons project for diagnosing PC boot problems.  Or
maybe it's one of the incubator projects.  I can't remember the name off-hand,
but Google'l find it for you.

   -- chris

(Just joking, of course, but you take my point ?)
Roedy Green - 09 Nov 2005 11:29 GMT
On Wed, 09 Nov 2005 10:28:41 +0000, Thomas Hawtin
<usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone
who said :

>There's probably more people who could help with my PC problems. I
>insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D
>315 and it doesn't boot. Please advice.

see http://mindprod.com/bgloss/cables.html#TREATING
j
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Sharp Tool - 10 Nov 2005 09:45 GMT
> > Thomas Hawtin coughed up:
> >>
[quoted text clipped - 9 lines]
>
> Tom Hawtin, BSc (Hons) Mathematics

Buy a new computer.

Sharp Tool
John C. Bollinger - 10 Nov 2005 12:20 GMT
> There's probably more people who could help with my PC problems. I
> insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D
> 315 and it doesn't boot. Please advice.

Um, take it back out?

Does the computer boot with only the new module (in either slot)?

Does the computer POST?  If so then you should be able to get into the
BIOS setup, where you should check whether the system recognizes the RAM
at all (if not, see first suggestion).  Choosing the option to reset the
BIOS to default settings may help, but sometimes all it takes is to get
into BIOS setup once in the first place, and then the computer sorts it out.

Does the first slot also have a 512MB module?  Some motherboards have
odd restrictions about the combinations of module sizes that are
allowed, or the order that the slots must be filled if the modules are
dissimilar.

Some RAM is simply incompatible with some system boards, despite having
the correct packaging for plugging it in.  If you can determine the
motherboard make and model (it may be stenciled somewhere on the board)
then you may be able to find a copy of its manual on line.  Referring to
that would be much better than diddling around trying random things.

Signature

John Bollinger
jobollin@indiana.edu

SDB - 06 Nov 2005 22:15 GMT
: Consider this list of numbers:
:
[quoted text clipped - 4 lines]
: -2.1
: -124.0

: what algorithm to use to remove large negative values such as -124.0?
: how to determine a cutoff value that is statistically meaningful?

: So far i have:

: cuff off = smallest positive - smallest difference in negative pairs
:            = 1.0 - (2.1 - 0.1)
:            = 1.0 - 2.0
:            = -1.0

How sophisticated do you need to be?  Consider using the absolute value so
you don't need to worry about positive or negative numbers.

If the numbers you gave are just an example and the problem you are trying
to solve is more generic, look at a statics value called the 'Z-Score' also
sometimes called the 'Z-Value'.  It computed by subtracting the number from
the mean then dividing it by the standard diviation of the set.  You can
throw out value outside a range of Z-scores.

From your set, the standard deviation is 52.15.

The z-Score of the second one, 5.0 is .8603
The z-Score of the last one, -124, is .0282

In stats, the z-Score is your friend.
Sharp Tool - 07 Nov 2005 08:42 GMT
> : Consider this list of numbers:
> :
[quoted text clipped - 30 lines]
>
> In stats, the z-Score is your friend.

My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values.
Z scores work with only with absolute values.
So whats the best way to go now? I'm not a statistician.

Cheers
Sharp Tool
Roedy Green - 07 Nov 2005 08:59 GMT
On Mon, 07 Nov 2005 08:42:24 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>My data does not fit a normal distribution.
>I do not want to eliminate any positive values.
>I only want to eliminate large negative values.
>Z scores work with only with absolute values.
>So whats the best way to go now? I'm not a statistician.

What distribution do they conform to?  
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Sharp Tool - 07 Nov 2005 09:19 GMT
> <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
> who said :
[quoted text clipped - 6 lines]
>
> What distribution do they conform to?

Random I believe.

Sharp Tool
Roedy Green - 07 Nov 2005 10:38 GMT
On Mon, 07 Nov 2005 09:19:19 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>> What distribution do they conform to?
>
>Random I believe.

In that case you can't make a case for tossing any of them.   Keep in
mind even normal distributions are still random.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Sharp Tool - 07 Nov 2005 11:01 GMT
> <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
> who said :
[quoted text clipped - 5 lines]
> In that case you can't make a case for tossing any of them.   Keep in
> mind even normal distributions are still random.

Your right.
The distribution looks like a bell shape curve skewed to the left with an
initial platoe then it slides to the right and then suddenly makes a sharp
dip down.
so i guess thats not really a normal distribution.

Sharp Tool
Roedy Green - 07 Nov 2005 11:32 GMT
On Mon, 07 Nov 2005 11:01:13 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>The distribution looks like a bell shape curve skewed to the left with an
>initial platoe then it slides to the right and then suddenly makes a sharp
>dip down.
>so i guess thats not really a normal distribution.

You may be able to analyse the physics of your readings to calculate
the expected distribution.

the classic shapes are not really clear until you have a lot of data.
You won't see the pattern with just 5 points.

This reminds me something that happened when I was studying physics at
UBC circa 1968. We were doing a lab with an experiment that was
supposed to produce a normal distribution.  But it obviously wasn't.
The machine was broken.  Student after student complained, but were
dismissed as incompetents.  I keypunched the data and did a histogram
and produced it on the pen plotter -- a great novelty in that day.  

It clearly showed a camel hump.  The COMPUTER graph clinched it and
off the machine went for repair.  You can't do that as easily today.
Back then anything that came from a computer was treated as divine
revelation.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Andrew Thompson - 07 Nov 2005 09:01 GMT
> My data does not fit a normal distribution.

What distribution/pattern/logic does it fit, because..

> I only want to eliminate large negative values.

..knowing that will lead to a lot closer to defining
(pinning down, and putting a value to) 'large'.

Beyond the hypothetical though, does this describe
an actual problem, or is it purely a mental exercise?
Sharp Tool - 07 Nov 2005 09:17 GMT
> > My data does not fit a normal distribution.
>
[quoted text clipped - 4 lines]
> ..knowing that will lead to a lot closer to defining
> (pinning down, and putting a value to) 'large'.

A large value is one that is an obvious outlier.
I only want to eliminate large negative values.
By eye-balling the list of numbers, you can see that -124.0
doesn't 'fit in'. Wondering if there a statistical method for this.

> Beyond the hypothetical though, does this describe
> an actual problem, or is it purely a mental exercise?

Mental exercise, but i think it could be useful for removing
negative outliers.

Sharp Tool
Andrew Thompson - 07 Nov 2005 10:53 GMT
>>>My data does not fit a normal distribution.
>>
[quoted text clipped - 6 lines]
>
> A large value is one that is an obvious

Obvious to who?  What is the cut-off limit for 'obvious'?

You quote '-124.0', but what about '-74.2', or '-24.0'.

To me, even '-24' could be an 'obvious' outlier.
But without some form of 'confidence level' and a
mathematically definable group, we cannot even
determine exactly what constitutes a cut-off limit.

With such vague descriptions of what the group represents,
there is really no way to progress the problem.

>..outlier.
> I only want to eliminate large negative values.
[quoted text clipped - 5 lines]
>
> Mental exercise, ..

I'll leave you with it.
Sharp Tool - 07 Nov 2005 11:25 GMT
> >>>My data does not fit a normal distribution.
> >>
[quoted text clipped - 8 lines]
>
> Obvious to who?  What is the cut-off limit for 'obvious'?

Obvious to me when i look (eye balling) at the list of numbers i presented
in my first posting.

> You quote '-124.0', but what about '-74.2', or '-24.0'.

There is no -74.2 or -24 in my original list.
If there were i would say -74.2 is possibly another negative outlier.
Again, there is not statistical backing for this.

> To me, even '-24' could be an 'obvious' outlier.
> But without some form of 'confidence level' and a
> mathematically definable group, we cannot even
> determine exactly what constitutes a cut-off limit.

'-24' does not seem like an obvious outlier to me.
Again, without some sort of statistics its all subjective.

> With such vague descriptions of what the group represents,
> there is really no way to progress the problem.

Not sure what you mean by mathematical definable group.
But I assume you mean the distribution of the data.
The confidence level would be the standard 95% in the statistical world.
The question is how to get a cutoff that will give me that confidence level.
Should one look at Z scores (this was suggested) or some other statistical
parameter to establish a cutoff or
just look at raw numbers to establish confidence level (this was suggested).

Its vague to you Andrew because its not your area of expertise - not my
'vague description'.

Sharp Tool
Andrew Thompson - 07 Nov 2005 11:52 GMT
> Its vague to you Andrew because its not your area of expertise - not my
> 'vague description'.

Very sound assessment, coming from someone who first stated
the numbers had no 'normal distribution' and is now saying
it does, and that a confidence level of 95% 'sounds good'.

> Sharp Tool

[  Seems a little 'blunt' at the moment..   ;-) ]
Sharp Tool - 08 Nov 2005 08:28 GMT
> > Its vague to you Andrew because its not your area of expertise - not my
> > 'vague description'.
>
> Very sound assessment, coming from someone who first stated
> the numbers had no 'normal distribution' and is now saying
> it does, and that a confidence level of 95% 'sounds good'.

I said I believe it has a normal distribution, which I later clarified it
didn't.
Andrew you have a real attitude problem.

> > Sharp Tool
>
> [  Seems a little 'blunt' at the moment..   ;-) ]

As blunt as your sense of humour.

Sharp Tool
Scott Ellsworth - 07 Nov 2005 20:17 GMT
Andrew Thompson wrote:
> > With such vague descriptions of what the group represents,
> > there is really no way to progress the problem.
>
> Not sure what you mean by mathematical definable group.
> But I assume you mean the distribution of the data.

Or, alternatively, the source of the data, and why you feel that a
cutoff of negative values should exist.

> Its vague to you Andrew because its not your area of expertise - not my
> 'vague description'.

No, you were vague.  Every decent statistician I know, and I do know a
few, makes fairly precise statements about the data source, and why,
therefore, certain data can be assumed an outlier.

Bayesians seem to talk _only_ about their prior.

Your could just as easily be a U[-124,12] as a normal, poisson, or
exponential.  Without some reason to declare -124 an outlier, I would be
very wary of dropping a sixth of my data points.

Now if there are actually more lurking in there, then you might be able
to perform a reasonable test to determine the negative outlier cutoff.

Scott

Signature

Scott Ellsworth
scott@alodar.nospam.com
Java and database consulting for the life sciences

Chris Uppal - 07 Nov 2005 10:16 GMT
> > > what algorithm to use to remove large negative values such as -124.0?
> > > how to determine a cutoff value that is statistically meaningful?
[...]
> My data does not fit a normal distribution.
> I do not want to eliminate any positive values.
> I only want to eliminate large negative values.
[...]
> So whats the best way to go now? I'm not a statistician.

I you really mean that you want it to be "statistically meaningful" then you'll
have to talk to a statistician.  In order for that talk to be worthwhile you'll
need to know what distribution the numbers do follow (either as an analytic
description -- possibly an approximation -- or as empirical data).  You will
also need to know whether the distribution is identical on each run, or whether
it parameterised in some way.  In the latter case the first part of the task
will be to estimate the parameters of the distribution based on the data from
that run (presumably including the positive values), then the second part of
the task will be eliminating data points that are "implausible" (in some fixed
sense) given the estimated distribution.

If the distribution is fixed across runs, then there is no need for the
curve-fitting step, and the question reduces to finding the a single, fixed,
threshold beyond which data-points are unlikely to occur by natural chance, and
which can therefore be dismissed (with a certain confidence) as outliers.  In
this case you can run some experiments to find what value 95% (say) of negative
values lie above.  On subsequent runs, values lower than that can be rejected
as "implausible" (on the assumption that they are drawn from the same
underlying distribution as your test runs).  I'm not a statistician, so I don't
know whether you would be able to claim 95% confidence in this case, nor how to
quantify how much test data you would need (nor, indeed, how the two
interrelate).

Googling for
   outlier removal
shows up lots of promising looking hints.

OTOH, it might be simplest to punt the question to the user, and have a
configurable parameter.  If you do that then you should follow hallowed
practice and:

a) Bury the parameter in an XML file somewhere.  Read and write out the data on
each run so that no human-readable formatting is preserved.

b) Give the parameter as vague and ambiguous a name as possible.  In this case
you should ensure that neither the parameter name nor its documentation give
any hint as to whether the value is intended to be an absolute cut-off value,
the negation of an absolute cut-off, a high percentile threshold, a low
percentile threshold, or the absolute number of datapoints to reject.

c) Attempt to ensure that the default value is unsuitable for use in any
real-world application.

If you want to "go the extra mile" and work to the very highest professional
standards, then you should also:

d) Ensure that this behaviour is controlled by several parameters.  The should
be confusingly named (a reliable technique here is to give them names that are
the opposite of what they actually mean), and should interact in ways that are
neither obvious nor documented.  You should further ensure that sensible
results can only be achieved by setting one of the parameters explicitly (no
combination of the other parameters has the same effect), and mark that as
"deprecated" in /some/ of the documentation, whilst also making heavy use of it
in any examples.

   -- chris
Sharp Tool - 07 Nov 2005 10:47 GMT
> > > > what algorithm to use to remove large negative values such as -124.0?
> > > > how to determine a cutoff value that is statistically meaningful?
[quoted text clipped - 27 lines]
> quantify how much test data you would need (nor, indeed, how the two
> interrelate).

You sure seem to know a fair bit about statistics.
My questions is now, how does one determine the distribution of data?
I haven't done much analysis but i say it looks random.
The cutoff based on the 95% confidence that negative values lies above
sounds good.
Looking at google searches now.

Sharp Tool
Roedy Green - 07 Nov 2005 11:17 GMT
On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>My questions is now, how does one determine the distribution of data?

One way is to do a histogram.

If you see a bell shaped curve coming out, you likely have a normal
distribution.

Various other distributions have a characteristic shape.

One that comes up often  is called Poisson. It looks like a skewed
bell shaped curve with the right hand side stretched out.
see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
How long you wait for bus might follow a Poisson distribution.

Geometric is a falling off. see
http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geomdistn

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Sharp Tool - 07 Nov 2005 11:35 GMT
> On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
> <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
[quoted text clipped - 15 lines]
>
> Geometric is a falling off. see

http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geo
mdistn

Thats a great link.
I plotted my data and it does look like a poisson distribution.
But the large negative number makes it falls off a cliff.
All these distribution dont include negative numbers?

Sharp Tool
Roedy Green - 07 Nov 2005 12:04 GMT
On Mon, 07 Nov 2005 11:35:27 GMT, "Sharp Tool"
<sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone
who said :

>All these distribution dont include negative numbers?

A normal is clustered about a mean, nominally 0, with symmetric tails
left and right.

Poisson is a distribution of positive numbers.

Just what do these numbers measure?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

John C. Bollinger - 08 Nov 2005 02:17 GMT
> OTOH, it might be simplest to punt the question to the user, and have a
> configurable parameter.  If you do that then you should follow hallowed
> practice and:

[ROFL]

Shhhh!  You forgot to make him to promise to use his knowledge only for
good!  :^)

Signature

John Bollinger
jobollin@indiana.edu



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.