Home |
Search |
Today's Posts |
#1
![]()
Posted to microsoft.public.excel.misc
|
|||
|
|||
![]()
Jeff apparently wrote:
Subject: how do I identify statistical outliers with excel? You start by putting your inquiry in the __body__ of your posting, not just in the subject line. Some news readers -- notably http://www.microsoft.com/communities...s/default.aspx -- do not show the entire subject line, and they do not permit posting responses when the body of the posting is empty. I will offer some answers to your question if you promise not to use this information to blindly and/or automatically exclude outliers from your data. There are many reasons for deprecating automatic exclusion; some people even deprecate excluding outliers manually. The primary reason for excluding outliers is if you believe they represent data entry errors. But some people argue that even that is not a good reason, pointing to some unusual historical anecdotes. Two numerical methods are commonly used to identify outliers: (1) relying on the interquartile range (IQR); and (2) relying on the standard deviation (SD). I prefer #1; #2 seems valid only when the distribution is known to be "normal". In either case, it is best to graph the data to confirm that the data are truly outliers in a subjective sense. Often, "outliers" are merely indicative of extreme skew in the distribution. One way to graph the data in Excel is to select the data, then click on the Chart Wizard on the toolbar and select the XY (Scatter) chart type. Some writers suggest using histograms for this purpose. I disagree because the shape of a histogram is greatly affected by the width of the bars with respect to the data distribution, which is an art and error-prone. One definition of an outlier is that it is outside"x" times the interquartile range from first or third quartile. "X" is typically 1.5 (mild) or 3 (extreme). I prefer 3 because it is more conservative. In Excel, the following compute the first and third quartiles in A1 and A2 and the interquartile range (IQR) in B1 for the data in D1:D100, for example: A1: =QUARTILE(D1:D100,1) A2: =QUARTILE(D1:D100,3) B1: =A2 - A1 Then the following determines if D1 is an "extreme" outlier: =if(or(D1 < A1 - 3*B1, D1 A2 + 3*B1), "outlier?", "") Another definition of an outlier is that it is outside "x" standard deviations (SD) from the average. "X" is typically 3 or 4. I would prefer 4 because it is more consistent with the definition of "extreme outlier" above. In Excel, if A1 contains =AVERAGE(D1:D100) and B1 contains =STDEV(D1:D100), for example, the following determines if D1 is an outlier: =if(or(D1 < A1 - 4*B1, D1 A1 + 4*B1), "outlier?", "") |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
excel chart rendering problem when using apache java Poi package. | Excel Discussion (Misc queries) | |||
Excel file with hyperlinks takes a long time to open over the network | Links and Linking in Excel | |||
Need suggestions for some uses of Ms Excel | Excel Discussion (Misc queries) | |||
TRYING TO SET UP EXCEL SPREADSHEET ON MY COMPUTER | New Users to Excel | |||
Excel Range Value issue (Excel 97 Vs Excel 2003) | Excel Discussion (Misc queries) |