#1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 373
Default 2.5 StdDevs

I have a column of numbers. I want to eliminate all data in the
column that is greater than 2.5 standard deviations from the mean,
leaving only the data that is within 2.5 StdDevs. I have a lot of
data, so I need to do this as quickly as possible so I can then get
statistical quartiles for the data. Any help on one or both issues
greatly appreciated! James

  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 43
Default 2.5 StdDevs

James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

  #3   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 373
Default 2.5 StdDevs

John, Thank you for your reply. I am well-versed in VBA programming,
but a complete dunce at advanced mathematics like this. The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for. The objective is to pare the data down to
the "reasonable" values and get the bulk of it for subsequent figuring
of quartiles. May I use the Reply to Author button to reply to you
directly? James

On Jul 9, 1:04?pm, wrote:
James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John



  #4   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 43
Default 2.5 StdDevs

James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John


  #5   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 373
Default 2.5 StdDevs

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John





  #6   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 6,953
Default 2.5 StdDevs

What is the basis for excluding data that is beyond 2.5 Std Dev from the mean.

Why not use the built in quartiles worksheet function against the raw data
to get the quartiles from the raw data. If the data is incorrect, throwing
out the data you suggest doesn't seem to have much justification to me. You
need to get correct data. Nonetheless:

Yes
=Average($m$2:$M$121411)

would give you the average

You could use an adjacent empty column to do

Assume the data points are in M as you show, in R2 for example

=if(Abs(M2-Average($M$2:$M$12411))2.5*Stdev($m$2:$M$121411), "Delete","Keep")

then do an autofilter on column R and select Delete in the dropdown. Select
all the data except the first row and do Edit=Delete and select entirerow.
Note that once you delete the rows (and remove the filter), the formulas will
recalculate and new mean and standard deviation and more rows will show up as
Delete, but you can ignore these.

--
Regards,
Tom Ogilvy


then drag fill down to



"Zone" wrote:

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John




  #7   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 43
Default 2.5 StdDevs

James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

  #8   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 23
Default 2.5 StdDevs

"Zone" wrote in message
ups.com...


The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for.


Hi Zone,
if you want to use the 2.5 StDev rule for removing outliers than your data
should be normally distributed.Ratios are known to be notoriously skewed.

1) Do a transform ; a log- or arcsin transformation could be usefull
2) Do a Kolmogorov-Smirnov test to see if you now have a normal distribution
3) Do the outlier test on the new data

Use statistics software if you can , SAS , Statistica , R , S
hth
Gys


Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 03:09 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"