About Us

Zone[_3_]

I have a column of numbers. I want to eliminate all data in the
column that is greater than 2.5 standard deviations from the mean,
leaving only the data that is within 2.5 StdDevs. I have a lot of
data, so I need to do this as quickly as possible so I can then get
statistical quartiles for the data. Any help on one or both issues
greatly appreciated! James

[email protected]

James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

Zone[_3_]

John, Thank you for your reply. I am well-versed in VBA programming,
but a complete dunce at advanced mathematics like this. The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for. The objective is to pare the data down to
the "reasonable" values and get the bulk of it for subsequent figuring
of quartiles. May I use the Reply to Author button to reply to you
directly? James

On Jul 9, 1:04?pm, wrote:
James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

[email protected]

James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

Zone[_3_]

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

Tom Ogilvy

What is the basis for excluding data that is beyond 2.5 Std Dev from the mean.

Why not use the built in quartiles worksheet function against the raw data
to get the quartiles from the raw data. If the data is incorrect, throwing
out the data you suggest doesn't seem to have much justification to me. You
need to get correct data. Nonetheless:

Yes
=Average($m$2:$M$121411)

would give you the average

You could use an adjacent empty column to do

Assume the data points are in M as you show, in R2 for example

=if(Abs(M2-Average($M$2:$M$12411))2.5*Stdev($m$2:$M$121411), "Delete","Keep")

then do an autofilter on column R and select Delete in the dropdown. Select
all the data except the first row and do Edit=Delete and select entirerow.
Note that once you delete the rows (and remove the filter), the formulas will
recalculate and new mean and standard deviation and more rows will show up as
Delete, but you can ignore these.

--
Regards,
Tom Ogilvy

then drag fill down to

"Zone" wrote:

John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, wrote:
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John

[email protected]

James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

Zone[_3_]

Tom and John,
I am paying very close attention to everything you say. The
problem is that for a particular ratio in a particular industry (such
as supply expense per client per day), it is known and published by
the experts that the figure should be in the range of a few hundred
dollars. Depending on other factors (which are captured), this could
vary between $200 or so and maybe $700. Problem is, the incoming data
includes figures such as 22 cents and $125,000, both of which are
obviously incorrect. Alas, we have no control over the incoming data,
so the best we can do is to try to grab the "reasonable" figures that
are the bulk of the data and ignore the "unreasonable" figures that
have resulted from data input errors, misunderstanding of what is
being asked, etc.
What is wanted is Top, Second and Third Quartiles for each
indicator. These cannot be figured unless the ridiculous data is
excluded. Tom, I presume that when you say the AVERAGE function will
get me the average, this is the same as the mean, since neither
StdDevs nor quartiles can be figured without the mean, to my limited
mathematical knowledge.
I acually wasn't aware that there were a spreadsheet function for
quartiles, so that is interesting. I take it that this reveals true
quartiles, not just counting down 3 in a set of 12 data, for instance.
I am grateful to both of you for your interest in my problem. I
readily admit to being mathematically obtuse. I have hundreds of
indicators to run, so I'm looking for a fast and efficient way to
handle this. James
On Jul 9, 2:56?pm, wrote:
James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

Tom Ogilvy

quartile returns the quartile value, not the position of the quartile. if
you had 12 random numbers between 100 and 1000, quartile(A1:A12,1) would
return the appropriate number between 100 and 1000, not 3.75 which would be
the position of the first quartile. See Excel help for details.

Average returns the arithmetic mean.

stdev returns the sample standard deviation. stdevp returns the population
standard deviation. Neither requires you to calculate the arithmetic mean
separately, but of course you would need that value to determine the distance
your datapoint is from the mean; but I already provided that information to
you.

--
Regards,
Tom Ogilvy

"Zone" wrote:

Tom and John,
I am paying very close attention to everything you say. The
problem is that for a particular ratio in a particular industry (such
as supply expense per client per day), it is known and published by
the experts that the figure should be in the range of a few hundred
dollars. Depending on other factors (which are captured), this could
vary between $200 or so and maybe $700. Problem is, the incoming data
includes figures such as 22 cents and $125,000, both of which are
obviously incorrect. Alas, we have no control over the incoming data,
so the best we can do is to try to grab the "reasonable" figures that
are the bulk of the data and ignore the "unreasonable" figures that
have resulted from data input errors, misunderstanding of what is
being asked, etc.
What is wanted is Top, Second and Third Quartiles for each
indicator. These cannot be figured unless the ridiculous data is
excluded. Tom, I presume that when you say the AVERAGE function will
get me the average, this is the same as the mean, since neither
StdDevs nor quartiles can be figured without the mean, to my limited
mathematical knowledge.
I acually wasn't aware that there were a spreadsheet function for
quartiles, so that is interesting. I take it that this reveals true
quartiles, not just counting down 3 in a set of 12 data, for instance.
I am grateful to both of you for your interest in my problem. I
readily admit to being mathematically obtuse. I have hundreds of
indicators to run, so I'm looking for a fast and efficient way to
handle this. James
On Jul 9, 2:56?pm, wrote:
James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

[email protected][_2_]

Hi Zone,

What makes you certain that 2.5 standard deviations is the right tool
for the job? I fathom you've been given a "left handed monkey wrench".
Tip: there is no such thing as a "left handed monkey wrench".

To put things in perspective:

Without qualifying a particular probablity density function, a
standard deviation is associated a normal curve - something that is
equally skewed to the left and right and looks like a single hump with
the middle of the distribution correspoding to both the mean and
median.The height of the hump and thickness of the data from the mean
can vary.

So what does 2.5 standard deviations correspond to?

Assuming a normal curve of mean 1 and STD 1:

Data outside of 1 STD corresponds to (assuming you're looking at both
the left and right tail) 33% (about 16% on each side)

Data outside of 2 STDs corresponds to 2.5%

Data outside of 2.5 STDs so closely corresponds to zero that it's not
within the tolerance of error computation. You're talking 1 out of
100,000 (That's my guess). This probablity is so small that you won't
find the value a published table on the normal curve.

In your case it's unlikely the data is distributed normally.

Also, supposed that you get 100,000 observations, that means that on
average you'll only be excluding 1 outlier. Most of the time you won't
be excluding any values.

Also, suppose there are 1000 outliers in 100,000 observations, that
means that, if you use the 2.5 STD approach you're going to end up
keeping most of the hosed observations.

In other words 2.5 STDs isn't the tool for this job.

What you need to do is filter the data, just as you described. Set
limits on minimum and maximum values and exclude anything smaller or
larger.

IF you're working with statisticians and you want to talk the jargon,
you should compute the skew statistic, approximate the statistical
distribution (that might a a Pareto or Beta distribution). At the
least you should confirm your assumption that the data is normally
distributed. And you should analyze whether assuming a normal
distribution is going to do the job.

As to fast ways of dealing with the data ... What is it that is so
time consuming? Just filter the data. What's time consuming about
that?

Zone[_3_]

Tom, Yes, I understand. The reason I asked is that I understand that
one statistical method is to throw out the highest and lowest numbers,
then, if 12 numbers are left, count down 3 and take that number as the
1st quartile. Say I have these numbers:
0.2
1
3.25
5
17
50
125
500
1250
1500
32505
50000

Assuming the most absurd numbers were already thrown out (leaving only
a few absurd numbers), counting down 3 and returning the number would
be 3.25. Obviously this is neither the position or the value of the
3rd number. However, say I want to get the center value of the
dataset.
=QUARTILE(M2:M13,2) gives 87.5, which does not appear to me to be the
center, mathematically.
=AVERAGE(M2:M13) gives 7163.04, which also does not appear to be the
center.

There is quite a difference between these two numbers! The quartile
value Excel gives seems to lie in the middle of the quartile, not on
the threshold.

If the expert says that the figure should be on the order of hundreds
of dollars, then conceivably everything below 50 and above 1500 could
be excluded. Unfortunately, I do not have an exact range for each of
these hundreds of indicators, so I'm trying to throw out outliers by
StdDevs (which, by the way, I can set to 2, 2.5, or 3). After
throwing out the outliers, I want to get meaningful figures for the
1st, 2nd and 3rd Quartiles. These will be used for judging
performance of the clients.

This demonstrates my dilemma. I would prefer not to go into hundreds
of lines of VBA code to render meaningful figures. Any further
guidance from you or John would be appreciated. James

On Jul 9, 3:54?pm, Tom Ogilvy
wrote:
quartile returns the quartile value, not the position of the quartile. if
you had 12 random numbers between 100 and 1000, quartile(A1:A12,1) would
return the appropriate number between 100 and 1000, not 3.75 which would be
the position of the first quartile. See Excel help for details.

Average returns the arithmetic mean.

stdev returns the sample standard deviation. stdevp returns the population
standard deviation. Neither requires you to calculate the arithmetic mean
separately, but of course you would need that value to determine the distance
your datapoint is from the mean; but I already provided that information to
you.

--
Regards,
Tom Ogilvy

"Zone" wrote:
Tom and John,
I am paying very close attention to everything you say. The
problem is that for a particular ratio in a particular industry (such
as supply expense per client per day), it is known and published by
the experts that the figure should be in the range of a few hundred
dollars. Depending on other factors (which are captured), this could
vary between $200 or so and maybe $700. Problem is, the incoming data
includes figures such as 22 cents and $125,000, both of which are
obviously incorrect. Alas, we have no control over the incoming data,
so the best we can do is to try to grab the "reasonable" figures that
are the bulk of the data and ignore the "unreasonable" figures that
have resulted from data input errors, misunderstanding of what is
being asked, etc.
What is wanted is Top, Second and Third Quartiles for each
indicator. These cannot be figured unless the ridiculous data is
excluded. Tom, I presume that when you say the AVERAGE function will
get me the average, this is the same as the mean, since neither
StdDevs nor quartiles can be figured without the mean, to my limited
mathematical knowledge.
I acually wasn't aware that there were a spreadsheet function for
quartiles, so that is interesting. I take it that this reveals true
quartiles, not just counting down 3 in a set of 12 data, for instance.
I am grateful to both of you for your interest in my problem. I
readily admit to being mathematically obtuse. I have hundreds of
indicators to run, so I'm looking for a fast and efficient way to
handle this. James
On Jul 9, 2:56?pm, wrote:
James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John- Hide quoted text -

- Show quoted text -

[email protected][_2_]

These numbers aren't normally distributed so STDs are the wrong too
for the job.

Say I have these numbers:
0.2
1
3.25
5
17
50
125
500
1250
1500
32505
50000

You need to filter the data - just like John lined out.

If you want a no code solution ask the client to filter the data on
his end.

I still don't understand why you think filtering the data requires
hundreds of lines of VBA.

Zone[_3_]

Gimme (Tom and John),
You're absolutely right. If the dataset is not a normal
distribution, then Excel's built-in functions Average, StdDev and
Quartiles are of no use whatsoever. What I didn't mention is that my
friend is an amateur mathematician. He created a model that can take
gamma distributions into account. I have very litlte idea of what
that means except that the program's model includes special functions
which can derive Average, StdDev and Quartiles based on a model that
morphs from normal to chi-squared.

I basically want to speed up his calculations using built-in Excel
functions where possible. I think I understand now why his algorithm
is so complicated. Thanks, guys. If you care to continue this
thread, please post any thoughts you might have. Best regards, James

Indeed, as you brought up,

On Jul 9, 5:14?pm, "
wrote:
These numbers aren't normally distributed so STDs are the wrong too
for the job.

Say I have these numbers:

0.2
1
3.25
5
17
50
125
500
1250
1500
32505
50000

You need to filter the data - just like John lined out.

If you want a no code solution ask the client to filter the data on
his end.

I still don't understand why you think filtering the data requires
hundreds of lines of VBA.

GysdeJongh

"Zone" wrote in message
ups.com...

The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for.

Hi Zone,
if you want to use the 2.5 StDev rule for removing outliers than your data
should be normally distributed.Ratios are known to be notoriously skewed.

1) Do a transform ; a log- or arcsin transformation could be usefull
2) Do a Kolmogorov-Smirnov test to see if you now have a normal distribution
3) Do the outlier test on the new data

Use statistics software if you can , SAS , Statistica , R , S
hth
Gys

[email protected][_2_]

If you find yourself doing a lot of statistical calculations on rows
and columns of data, and if you're in the project for the long haul,
check out (and Google) JExcel.

It's definitely the tool for the job when it comes to fetching arrays
of data from Excel, analyzing the data, and stuffing results back into
Excel.

You mentioned you didn't want to spend a lot of time writing code, so
this might not be for you. But still it's interesting. If you knew J
perhaps it would be the tool for the job.

JExcel was created by a developer from Morgan Stanley.

Tom Ogilvy

mean median mode quartiles variance, standard deviations are not specific to
the normal distribution nor do they have any implicit assumption of
normality. They are called descriptive statistics and are used to describe
and analyze your data.

If you have faith in your friend, then it is unclear why you are posting
here.

You have been give suggestions on how you could "trim" you data none of
which require code and you could have completed them in less than 1% of the
time this thread has existed.

--
Regards,
Tom Ogilvy

"Zone" wrote:

Gimme (Tom and John),
You're absolutely right. If the dataset is not a normal
distribution, then Excel's built-in functions Average, StdDev and
Quartiles are of no use whatsoever. What I didn't mention is that my
friend is an amateur mathematician. He created a model that can take
gamma distributions into account. I have very litlte idea of what
that means except that the program's model includes special functions
which can derive Average, StdDev and Quartiles based on a model that
morphs from normal to chi-squared.

I basically want to speed up his calculations using built-in Excel
functions where possible. I think I understand now why his algorithm
is so complicated. Thanks, guys. If you care to continue this
thread, please post any thoughts you might have. Best regards, James

Indeed, as you brought up,

On Jul 9, 5:14?pm, "
wrote:
These numbers aren't normally distributed so STDs are the wrong too
for the job.

Say I have these numbers:

0.2
1
3.25
5
17
50
125
500
1250
1500
32505
50000

You need to filter the data - just like John lined out.

If you want a no code solution ask the client to filter the data on
his end.

I still don't understand why you think filtering the data requires
hundreds of lines of VBA.

Zone[_3_]

Thanks, everyone, for the great advice. Much appreciated. I should
be able to tackle this now with more confidence. James

On Jul 9, 10:56?pm, Tom Ogilvy
wrote:
mean median mode quartiles variance, standard deviations are not specific to
the normal distribution nor do they have any implicit assumption of
normality. They are called descriptive statistics and are used to describe
and analyze your data.

If you have faith in your friend, then it is unclear why you are posting
here.

You have been give suggestions on how you could "trim" you data none of
which require code and you could have completed them in less than 1% of the
time this thread has existed.

--
Regards,
Tom Ogilvy

"Zone" wrote:
Gimme (Tom and John),
You're absolutely right. If the dataset is not a normal
distribution, then Excel's built-in functions Average, StdDev and
Quartiles are of no use whatsoever. What I didn't mention is that my
friend is an amateur mathematician. He created a model that can take
gamma distributions into account. I have very litlte idea of what
that means except that the program's model includes special functions
which can derive Average, StdDev and Quartiles based on a model that
morphs from normal to chi-squared.

I basically want to speed up his calculations using built-in Excel
functions where possible. I think I understand now why his algorithm
is so complicated. Thanks, guys. If you care to continue this
thread, please post any thoughts you might have. Best regards, James

Indeed, as you brought up,

On Jul 9, 5:14?pm, "
wrote:
These numbers aren't normally distributed so STDs are the wrong too
for the job.

Say I have these numbers:

0.2
1
3.25
5
17
50
125
500
1250
1500
32505
50000

You need to filter the data - just like John lined out.

If you want a no code solution ask the client to filter the data on
his end.

I still don't understand why you think filtering the data requires
hundreds of lines of VBA.- Hide quoted text -

- Show quoted text -

[email protected]

Tom is correct....those are all "descriptive statistics". But so are
the folks that talk about the normality requirement when handling
outliers based on standard deviations. I still don't fully understand
what Zone is trying to accomplish. The original post spoke to
determining the quartile points, which doesn't require normality. But
there was also talk of truncating the data set to those points lying
within 2.5 standard deviations of the mean, which does require
normality. There are other methods of handling outlying data. You
might look into methods for accommodating outlying data (as opposed to
eliminating them from the data set). One such method is known as the
Winsorized Mean method.

If you are looking to understand the dataset, then why not plot it?
Create 2 plots, a probability density function (PDF) and a cumulative
density function (CDF). These are easy to make, even for very large
datasets. A histogram is very similar to a PDF and can be created
using the Analysis ToolPak add-in. Google these terms and develop
the necessary albeit brief understanding of the concepts. Then you
will likely be in a better position to help your friend.

John

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Menu

About Us