LinkBack Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
  #1   Report Post  
Posted to microsoft.public.excel.misc
external usenet poster
 
Posts: 418
Default how do I identify statistical outliers with excel?

Jeff apparently wrote:
Subject: how do I identify statistical outliers with excel?


You start by putting your inquiry in the __body__ of your posting, not
just in the subject line. Some news readers -- notably
http://www.microsoft.com/communities...s/default.aspx --
do not show the entire subject line, and they do not permit posting
responses when the body of the posting is empty.

I will offer some answers to your question if you promise not to use
this information to blindly and/or automatically exclude outliers from
your data. There are many reasons for deprecating automatic exclusion;
some people even deprecate excluding outliers manually. The primary
reason for excluding outliers is if you believe they represent data
entry errors. But some people argue that even that is not a good
reason, pointing to some unusual historical anecdotes.

Two numerical methods are commonly used to identify outliers: (1)
relying on the interquartile range (IQR); and (2) relying on the
standard deviation (SD). I prefer #1; #2 seems valid only when the
distribution is known to be "normal". In either case, it is best to
graph the data to confirm that the data are truly outliers in a
subjective sense. Often, "outliers" are merely indicative of extreme
skew in the distribution.

One way to graph the data in Excel is to select the data, then click on
the Chart Wizard on the toolbar and select the XY (Scatter) chart type.
Some writers suggest using histograms for this purpose. I disagree
because the shape of a histogram is greatly affected by the width of
the bars with respect to the data distribution, which is an art and
error-prone.

One definition of an outlier is that it is outside"x" times the
interquartile range from first or third quartile. "X" is typically 1.5
(mild) or 3 (extreme). I prefer 3 because it is more conservative. In
Excel, the following compute the first and third quartiles in A1 and A2
and the interquartile range (IQR) in B1 for the data in D1:D100, for
example:

A1: =QUARTILE(D1:D100,1)
A2: =QUARTILE(D1:D100,3)
B1: =A2 - A1

Then the following determines if D1 is an "extreme" outlier:

=if(or(D1 < A1 - 3*B1, D1 A2 + 3*B1), "outlier?", "")

Another definition of an outlier is that it is outside "x" standard
deviations (SD) from the average. "X" is typically 3 or 4. I would
prefer 4 because it is more consistent with the definition of "extreme
outlier" above. In Excel, if A1 contains =AVERAGE(D1:D100) and B1
contains =STDEV(D1:D100), for example, the following determines if D1
is an outlier:

=if(or(D1 < A1 - 4*B1, D1 A1 + 4*B1), "outlier?", "")

 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
excel chart rendering problem when using apache java Poi package. Tom Excel Discussion (Misc queries) 0 December 21st 06 09:03 AM
Excel file with hyperlinks takes a long time to open over the network Didier P Links and Linking in Excel 3 July 4th 06 04:39 PM
Need suggestions for some uses of Ms Excel Bible John Excel Discussion (Misc queries) 1 February 27th 06 05:30 PM
TRYING TO SET UP EXCEL SPREADSHEET ON MY COMPUTER MEGTOM New Users to Excel 5 October 27th 05 03:06 AM
Excel Range Value issue (Excel 97 Vs Excel 2003) Keeno Excel Discussion (Misc queries) 2 June 13th 05 02:01 PM


All times are GMT +1. The time now is 06:29 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"