Java Forum / General / November 2005
Riddle me this
Sharp Tool - 06 Nov 2005 08:46 GMT Hi
Consider this list of numbers:
12.0 5.0 1.0 -0.1 -2.1 -124.0
what algorithm to use to remove large negative values such as -124.0? how to determine a cutoff value that is statistically meaningful?
So far i have:
cuff off = smallest positive - smallest difference in negative pairs = 1.0 - (2.1 - 0.1) = 1.0 - 2.0 = -1.0
Problem is that would eliminate - 2.1!
Help appreciated. Sharp Tool
Roedy Green - 06 Nov 2005 09:26 GMT On Sun, 06 Nov 2005 08:46:17 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>what algorithm to use to remove large negative values such as -124.0? >how to determine a cutoff value that is statistically meaningful? That is not usually a statistical question but a plausibility question. If you are scanning data for temperatures of Honolulu you would look at history, give yourself a safety factor, and chop below and above a given range.
Readings for human temperatures would have a narrower range unless you included corpses.
If your numbers fit a normal bell shaped curve, you can compute the mean and standard deviation. Then you could throw out numbers more than n deviations from the mean.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Thomas Hawtin - 06 Nov 2005 09:29 GMT > > what algorithm to use to remove large negative values such as -124.0? > how to determine a cutoff value that is statistically meaningful? This newsgroup probably isn't the best place to find statisticians (although I guess there are a few).
You could google for "outliers" or similar. "Grubbs' Test for Outliers" seems like a step in the right direction.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Sharp Tool - 07 Nov 2005 08:37 GMT >> Sharp Tool wrote: > > [quoted text clipped - 8 lines] > > Tom Hawtin Grubbs Test is only suitable for data that has a normal distribution - mine does not.
Cheers Sharp
Thomas G. Marshall - 09 Nov 2005 04:46 GMT Thomas Hawtin coughed up:
>> what algorithm to use to remove large negative values such as -124.0? >> how to determine a cutoff value that is statistically meaningful? > > This newsgroup probably isn't the best place to find statisticians > (although I guess there are a few). No, but comp.programming often has quite a few folks from many mathematics related fields, statistics being one of them.
...[rip]...
 Signature I've seen this a few times--Don't make this mistake:
Dwight: "This thing is wildly available." Smedly: "Did you mean wildly, or /widely/ ?" Dwight: "Both!", said while nodding emphatically.
Dwight was exposed to have made a grammatical error and tries to cover it up by thinking fast. This is so painfully obvious that he only succeeds in looking worse.
Thomas Hawtin - 09 Nov 2005 10:27 GMT > Thomas Hawtin coughed up: >> [quoted text clipped - 3 lines] > No, but comp.programming often has quite a few folks from many mathematics > related fields, statistics being one of them. There's probably more people who could help with my PC problems. I insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D 315 and it doesn't boot. Please advice.
Tom Hawtin, BSc (Hons) Mathematics
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Chris Uppal - 09 Nov 2005 10:41 GMT > There's probably more people who could help with my PC problems. I > insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D > 315 and it doesn't boot. Please advice. I think there's a Jakarta commons project for diagnosing PC boot problems. Or maybe it's one of the incubator projects. I can't remember the name off-hand, but Google'l find it for you.
-- chris
(Just joking, of course, but you take my point ?)
Roedy Green - 09 Nov 2005 11:29 GMT On Wed, 09 Nov 2005 10:28:41 +0000, Thomas Hawtin <usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone who said :
>There's probably more people who could help with my PC problems. I >insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D >315 and it doesn't boot. Please advice. see http://mindprod.com/bgloss/cables.html#TREATING j
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Sharp Tool - 10 Nov 2005 09:45 GMT > > Thomas Hawtin coughed up: > >> [quoted text clipped - 9 lines] > > Tom Hawtin, BSc (Hons) Mathematics Buy a new computer.
Sharp Tool
John C. Bollinger - 10 Nov 2005 12:20 GMT > There's probably more people who could help with my PC problems. I > insert a KVR333X64C25/512 in the second slot of my eSys ePC Celeron-D > 315 and it doesn't boot. Please advice. Um, take it back out?
Does the computer boot with only the new module (in either slot)?
Does the computer POST? If so then you should be able to get into the BIOS setup, where you should check whether the system recognizes the RAM at all (if not, see first suggestion). Choosing the option to reset the BIOS to default settings may help, but sometimes all it takes is to get into BIOS setup once in the first place, and then the computer sorts it out.
Does the first slot also have a 512MB module? Some motherboards have odd restrictions about the combinations of module sizes that are allowed, or the order that the slots must be filled if the modules are dissimilar.
Some RAM is simply incompatible with some system boards, despite having the correct packaging for plugging it in. If you can determine the motherboard make and model (it may be stenciled somewhere on the board) then you may be able to find a copy of its manual on line. Referring to that would be much better than diddling around trying random things.
 Signature John Bollinger jobollin@indiana.edu
SDB - 06 Nov 2005 22:15 GMT : Consider this list of numbers: : [quoted text clipped - 4 lines] : -2.1 : -124.0
: what algorithm to use to remove large negative values such as -124.0? : how to determine a cutoff value that is statistically meaningful?
: So far i have:
: cuff off = smallest positive - smallest difference in negative pairs : = 1.0 - (2.1 - 0.1) : = 1.0 - 2.0 : = -1.0 How sophisticated do you need to be? Consider using the absolute value so you don't need to worry about positive or negative numbers.
If the numbers you gave are just an example and the problem you are trying to solve is more generic, look at a statics value called the 'Z-Score' also sometimes called the 'Z-Value'. It computed by subtracting the number from the mean then dividing it by the standard diviation of the set. You can throw out value outside a range of Z-scores.
From your set, the standard deviation is 52.15.
The z-Score of the second one, 5.0 is .8603 The z-Score of the last one, -124, is .0282
In stats, the z-Score is your friend.
Sharp Tool - 07 Nov 2005 08:42 GMT > : Consider this list of numbers: > : [quoted text clipped - 30 lines] > > In stats, the z-Score is your friend. My data does not fit a normal distribution. I do not want to eliminate any positive values. I only want to eliminate large negative values. Z scores work with only with absolute values. So whats the best way to go now? I'm not a statistician.
Cheers Sharp Tool
Roedy Green - 07 Nov 2005 08:59 GMT On Mon, 07 Nov 2005 08:42:24 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>My data does not fit a normal distribution. >I do not want to eliminate any positive values. >I only want to eliminate large negative values. >Z scores work with only with absolute values. >So whats the best way to go now? I'm not a statistician. What distribution do they conform to?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Sharp Tool - 07 Nov 2005 09:19 GMT > <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone > who said : [quoted text clipped - 6 lines] > > What distribution do they conform to? Random I believe.
Sharp Tool
Roedy Green - 07 Nov 2005 10:38 GMT On Mon, 07 Nov 2005 09:19:19 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>> What distribution do they conform to? > >Random I believe. In that case you can't make a case for tossing any of them. Keep in mind even normal distributions are still random.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Sharp Tool - 07 Nov 2005 11:01 GMT > <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone > who said : [quoted text clipped - 5 lines] > In that case you can't make a case for tossing any of them. Keep in > mind even normal distributions are still random. Your right. The distribution looks like a bell shape curve skewed to the left with an initial platoe then it slides to the right and then suddenly makes a sharp dip down. so i guess thats not really a normal distribution.
Sharp Tool
Roedy Green - 07 Nov 2005 11:32 GMT On Mon, 07 Nov 2005 11:01:13 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>The distribution looks like a bell shape curve skewed to the left with an >initial platoe then it slides to the right and then suddenly makes a sharp >dip down. >so i guess thats not really a normal distribution. You may be able to analyse the physics of your readings to calculate the expected distribution.
the classic shapes are not really clear until you have a lot of data. You won't see the pattern with just 5 points.
This reminds me something that happened when I was studying physics at UBC circa 1968. We were doing a lab with an experiment that was supposed to produce a normal distribution. But it obviously wasn't. The machine was broken. Student after student complained, but were dismissed as incompetents. I keypunched the data and did a histogram and produced it on the pen plotter -- a great novelty in that day.
It clearly showed a camel hump. The COMPUTER graph clinched it and off the machine went for repair. You can't do that as easily today. Back then anything that came from a computer was treated as divine revelation.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Andrew Thompson - 07 Nov 2005 09:01 GMT > My data does not fit a normal distribution. What distribution/pattern/logic does it fit, because..
> I only want to eliminate large negative values. ..knowing that will lead to a lot closer to defining (pinning down, and putting a value to) 'large'.
Beyond the hypothetical though, does this describe an actual problem, or is it purely a mental exercise?
Sharp Tool - 07 Nov 2005 09:17 GMT > > My data does not fit a normal distribution. > [quoted text clipped - 4 lines] > ..knowing that will lead to a lot closer to defining > (pinning down, and putting a value to) 'large'. A large value is one that is an obvious outlier. I only want to eliminate large negative values. By eye-balling the list of numbers, you can see that -124.0 doesn't 'fit in'. Wondering if there a statistical method for this.
> Beyond the hypothetical though, does this describe > an actual problem, or is it purely a mental exercise? Mental exercise, but i think it could be useful for removing negative outliers.
Sharp Tool
Andrew Thompson - 07 Nov 2005 10:53 GMT >>>My data does not fit a normal distribution. >> [quoted text clipped - 6 lines] > > A large value is one that is an obvious Obvious to who? What is the cut-off limit for 'obvious'?
You quote '-124.0', but what about '-74.2', or '-24.0'.
To me, even '-24' could be an 'obvious' outlier. But without some form of 'confidence level' and a mathematically definable group, we cannot even determine exactly what constitutes a cut-off limit.
With such vague descriptions of what the group represents, there is really no way to progress the problem.
>..outlier. > I only want to eliminate large negative values. [quoted text clipped - 5 lines] > > Mental exercise, .. I'll leave you with it.
Sharp Tool - 07 Nov 2005 11:25 GMT > >>>My data does not fit a normal distribution. > >> [quoted text clipped - 8 lines] > > Obvious to who? What is the cut-off limit for 'obvious'? Obvious to me when i look (eye balling) at the list of numbers i presented in my first posting.
> You quote '-124.0', but what about '-74.2', or '-24.0'. There is no -74.2 or -24 in my original list. If there were i would say -74.2 is possibly another negative outlier. Again, there is not statistical backing for this.
> To me, even '-24' could be an 'obvious' outlier. > But without some form of 'confidence level' and a > mathematically definable group, we cannot even > determine exactly what constitutes a cut-off limit. '-24' does not seem like an obvious outlier to me. Again, without some sort of statistics its all subjective.
> With such vague descriptions of what the group represents, > there is really no way to progress the problem. Not sure what you mean by mathematical definable group. But I assume you mean the distribution of the data. The confidence level would be the standard 95% in the statistical world. The question is how to get a cutoff that will give me that confidence level. Should one look at Z scores (this was suggested) or some other statistical parameter to establish a cutoff or just look at raw numbers to establish confidence level (this was suggested).
Its vague to you Andrew because its not your area of expertise - not my 'vague description'.
Sharp Tool
Andrew Thompson - 07 Nov 2005 11:52 GMT > Its vague to you Andrew because its not your area of expertise - not my > 'vague description'. Very sound assessment, coming from someone who first stated the numbers had no 'normal distribution' and is now saying it does, and that a confidence level of 95% 'sounds good'.
> Sharp Tool [ Seems a little 'blunt' at the moment.. ;-) ]
Sharp Tool - 08 Nov 2005 08:28 GMT > > Its vague to you Andrew because its not your area of expertise - not my > > 'vague description'. > > Very sound assessment, coming from someone who first stated > the numbers had no 'normal distribution' and is now saying > it does, and that a confidence level of 95% 'sounds good'. I said I believe it has a normal distribution, which I later clarified it didn't. Andrew you have a real attitude problem.
> > Sharp Tool > > [ Seems a little 'blunt' at the moment.. ;-) ] As blunt as your sense of humour.
Sharp Tool
Scott Ellsworth - 07 Nov 2005 20:17 GMT Andrew Thompson wrote:
> > With such vague descriptions of what the group represents, > > there is really no way to progress the problem. > > Not sure what you mean by mathematical definable group. > But I assume you mean the distribution of the data. Or, alternatively, the source of the data, and why you feel that a cutoff of negative values should exist.
> Its vague to you Andrew because its not your area of expertise - not my > 'vague description'. No, you were vague. Every decent statistician I know, and I do know a few, makes fairly precise statements about the data source, and why, therefore, certain data can be assumed an outlier.
Bayesians seem to talk _only_ about their prior.
Your could just as easily be a U[-124,12] as a normal, poisson, or exponential. Without some reason to declare -124 an outlier, I would be very wary of dropping a sixth of my data points.
Now if there are actually more lurking in there, then you might be able to perform a reasonable test to determine the negative outlier cutoff.
Scott
 Signature Scott Ellsworth scott@alodar.nospam.com Java and database consulting for the life sciences
Chris Uppal - 07 Nov 2005 10:16 GMT > > > what algorithm to use to remove large negative values such as -124.0? > > > how to determine a cutoff value that is statistically meaningful? [...]
> My data does not fit a normal distribution. > I do not want to eliminate any positive values. > I only want to eliminate large negative values. [...]
> So whats the best way to go now? I'm not a statistician. I you really mean that you want it to be "statistically meaningful" then you'll have to talk to a statistician. In order for that talk to be worthwhile you'll need to know what distribution the numbers do follow (either as an analytic description -- possibly an approximation -- or as empirical data). You will also need to know whether the distribution is identical on each run, or whether it parameterised in some way. In the latter case the first part of the task will be to estimate the parameters of the distribution based on the data from that run (presumably including the positive values), then the second part of the task will be eliminating data points that are "implausible" (in some fixed sense) given the estimated distribution.
If the distribution is fixed across runs, then there is no need for the curve-fitting step, and the question reduces to finding the a single, fixed, threshold beyond which data-points are unlikely to occur by natural chance, and which can therefore be dismissed (with a certain confidence) as outliers. In this case you can run some experiments to find what value 95% (say) of negative values lie above. On subsequent runs, values lower than that can be rejected as "implausible" (on the assumption that they are drawn from the same underlying distribution as your test runs). I'm not a statistician, so I don't know whether you would be able to claim 95% confidence in this case, nor how to quantify how much test data you would need (nor, indeed, how the two interrelate).
Googling for outlier removal shows up lots of promising looking hints.
OTOH, it might be simplest to punt the question to the user, and have a configurable parameter. If you do that then you should follow hallowed practice and:
a) Bury the parameter in an XML file somewhere. Read and write out the data on each run so that no human-readable formatting is preserved.
b) Give the parameter as vague and ambiguous a name as possible. In this case you should ensure that neither the parameter name nor its documentation give any hint as to whether the value is intended to be an absolute cut-off value, the negation of an absolute cut-off, a high percentile threshold, a low percentile threshold, or the absolute number of datapoints to reject.
c) Attempt to ensure that the default value is unsuitable for use in any real-world application.
If you want to "go the extra mile" and work to the very highest professional standards, then you should also:
d) Ensure that this behaviour is controlled by several parameters. The should be confusingly named (a reliable technique here is to give them names that are the opposite of what they actually mean), and should interact in ways that are neither obvious nor documented. You should further ensure that sensible results can only be achieved by setting one of the parameters explicitly (no combination of the other parameters has the same effect), and mark that as "deprecated" in /some/ of the documentation, whilst also making heavy use of it in any examples.
-- chris
Sharp Tool - 07 Nov 2005 10:47 GMT > > > > what algorithm to use to remove large negative values such as -124.0? > > > > how to determine a cutoff value that is statistically meaningful? [quoted text clipped - 27 lines] > quantify how much test data you would need (nor, indeed, how the two > interrelate). You sure seem to know a fair bit about statistics. My questions is now, how does one determine the distribution of data? I haven't done much analysis but i say it looks random. The cutoff based on the 95% confidence that negative values lies above sounds good. Looking at google searches now.
Sharp Tool
Roedy Green - 07 Nov 2005 11:17 GMT On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>My questions is now, how does one determine the distribution of data? One way is to do a histogram.
If you see a bell shaped curve coming out, you likely have a normal distribution.
Various other distributions have a characteristic shape.
One that comes up often is called Poisson. It looks like a skewed bell shaped curve with the right hand side stretched out. see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html How long you wait for bus might follow a Poisson distribution.
Geometric is a falling off. see http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geomdistn
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Sharp Tool - 07 Nov 2005 11:35 GMT > On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool" > <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone [quoted text clipped - 15 lines] > > Geometric is a falling off. see http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geo mdistn
Thats a great link. I plotted my data and it does look like a poisson distribution. But the large negative number makes it falls off a cliff. All these distribution dont include negative numbers?
Sharp Tool
Roedy Green - 07 Nov 2005 12:04 GMT On Mon, 07 Nov 2005 11:35:27 GMT, "Sharp Tool" <sharp.tool@bigpond.net.au> wrote, quoted or indirectly quoted someone who said :
>All these distribution dont include negative numbers? A normal is clustered about a mean, nominally 0, with symmetric tails left and right.
Poisson is a distribution of positive numbers.
Just what do these numbers measure?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
John C. Bollinger - 08 Nov 2005 02:17 GMT > OTOH, it might be simplest to punt the question to the user, and have a > configurable parameter. If you do that then you should follow hallowed > practice and: [ROFL]
Shhhh! You forgot to make him to promise to use his knowledge only for good! :^)
 Signature John Bollinger jobollin@indiana.edu
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|