![]() |
Z score on existing data
I have 12 data points representing 12 months of data for each of 100
people. I'm looking for outliers. So I set up a high z score, max-mean/ SD, and a low z, mean-min/SD. When I look at my table, my z scores look low; I see clear outliers that have a reletively low z. I think that this is due to the outlier being compared to a set of data that it is already part of, contributing both to increasing the denominator as well as lowering the numerator by being part of the mean. If I were to compare a new value to the set of twelve data points, the z would be accurate, but I somehow have to take the outlier out of the group and then recalculate the z and I have to check for both low and high. With a lot of rows of data, doing this manually is impossible. I can't figure out how to do it in excel either. Am I using the wrong approach or is there a way, or a mathematical compensation that makes up for the high and low outlier being part of the mean and SD? |
Z score on existing data
"jimbo" wrote:
I somehow have to take the outlier out of the group and then recalculate the z and I have to check for both low and high. [....] I can't figure out how to do it in excel either. Am I using the wrong approach or is there a way, or a mathematical compensation that makes up for the high and low outlier being part of the mean and SD? I have used a similar iterative approach when eliminating outliers. AFAIK, there is no "mathematical compensation" factor. You simply need to exclude outliers from the (re)computation of the mean and SD. But there are tricks that you might be able to use to make it easier to do in Excel. One approach to eliminating outliers is to convert them to text or to clear their cell. Note that Excel functions like AVERAGE and STDEV ignore text and empty cells. Therefore, you do not need to adjust their parameters. For example, if D90 contains the number 123 and you determine that is an outlier, change the cell to '123 (note the apostrophe) and change the horizontal format to right-alignment. Or simply clear D90. You can still use AVERAGE(A1:L100) and STDEV(A1:L100). The mean and SD are adjusted automatically, and the identification of subsequent outliers will be based on the new mean and SD. Note-1: If you change 123 to text, it would be prudent to change to the font or cell fill color or otherwise identify the data as an excluded outlier. Note-2: Simply changing the cell format to Text alone does not cause numeric constants to be converted to text, although it might change the horizontal alignment to be similar to text. Alternatively, it might be tempting to simply change the font or cell fill color; that is easy to do automatically with a Condition Format. Then you might use array-entered AVERAGE and STDEV formulas to ignore the colored outliers. However, AFAIK, there is no Excel function that could be used in a conditional expression to recognized and ignore colored outliers. In particular, CELLS("color") can__not__ be used for that purpose. But you could write your own VBA function to recognize colored outliers. See http://www.cpearson.com/excel/colors.aspx for details and ideas. Then the mean and SD might be computed by the following array-entered formulas (press ctrl+shift+Enter instead of just Enter): =AVERAGE(IF(MYCOLOR(A1:L100)=0,A1:L100)) =STDEV(IF(MYCOLOR(A1:L100)=0,A1:L100)) Although I believe the __elimination__ of outliers (in contrast to their __identification__) should be a manual procedure, you could implement the iterative process in a macro. Some unsolicited comments.... "jimbo" wrote: I have 12 data points representing 12 months of data for each of 100 people. I'm looking for outliers. So I set up a high z score, max-mean/SD, and a low z, mean-min/SD. When I look at my table, my z scores look low; I see clear outliers that have a reletively low z. Using the z-score to identify outliers is appropriate if the data are normally distributed, or if you have good reason to believe a large population of data would be. Otherwise, you might consider alternative criteria, such as the interquartile range (IQR). In any case, "12 data points" is not much data for computing statistics and identifying outliers. I hope you mean that the mean and SD are based on 1200 data points, which you might group by month for presentation. |
All times are GMT +1. The time now is 06:05 PM. |
Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
ExcelBanter.com