About Us

Zone[_3_]

I have a column of numbers. I want to eliminate all data in the
column that is greater than 2.5 standard deviations from the mean,
leaving only the data that is within 2.5 StdDevs. I have a lot of
data, so I need to do this as quickly as possible so I can then get
statistical quartiles for the data. Any help on one or both issues
greatly appreciated! James

[email protected]

James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

Zone[_3_]

John, Thank you for your reply. I am well-versed in VBA programming,
but a complete dunce at advanced mathematics like this. The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for. The objective is to pare the data down to
the "reasonable" values and get the bulk of it for subsequent figuring
of quartiles. May I use the Reply to Author button to reply to you
directly? James

On Jul 9, 1:04?pm, wrote:
James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

[email protected]

James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

Zone[_3_]

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

Tom Ogilvy

What is the basis for excluding data that is beyond 2.5 Std Dev from the mean.

Why not use the built in quartiles worksheet function against the raw data
to get the quartiles from the raw data. If the data is incorrect, throwing
out the data you suggest doesn't seem to have much justification to me. You
need to get correct data. Nonetheless:

Yes
=Average($m$2:$M$121411)

would give you the average

You could use an adjacent empty column to do

Assume the data points are in M as you show, in R2 for example

=if(Abs(M2-Average($M$2:$M$12411))2.5*Stdev($m$2:$M$121411), "Delete","Keep")

then do an autofilter on column R and select Delete in the dropdown. Select
all the data except the first row and do Edit=Delete and select entirerow.
Note that once you delete the rows (and remove the filter), the formulas will
recalculate and new mean and standard deviation and more rows will show up as
Delete, but you can ignore these.

--
Regards,
Tom Ogilvy

then drag fill down to

"Zone" wrote:

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

[email protected]

James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

GysdeJongh

"Zone" wrote in message
ups.com...

The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for.

Hi Zone,
if you want to use the 2.5 StDev rule for removing outliers than your data
should be normally distributed.Ratios are known to be notoriously skewed.

1) Do a transform ; a log- or arcsin transformation could be usefull
2) Do a Kolmogorov-Smirnov test to see if you now have a normal distribution
3) Do the outlier test on the new data

Use statistics software if you can , SAS , Statistica , R , S
hth
Gys

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Menu

About Us