Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
I have a column of numbers. I want to eliminate all data in the
column that is greater than 2.5 standard deviations from the mean, leaving only the data that is within 2.5 StdDevs. I have a lot of data, so I need to do this as quickly as possible so I can then get statistical quartiles for the data. Any help on one or both issues greatly appreciated! James |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
James,
There are many "sexy" ways to do it, but if you only have to do it once or twice.... 1) Calculate the mean and standard deviation 2) Calculate the upper and lower cutoff values 3) Sort the data 4) Manually delete those above/below the cutoff values If this is something you will do on a regular basis, then it may be worth pursing a more automated method. HTH, John |
#3
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
John, Thank you for your reply. I am well-versed in VBA programming,
but a complete dunce at advanced mathematics like this. The problem is that the data is very "raw" -- about 12,000 datapoints ranging from 0.22 to over 100,000. You probably know that this is the kind of mess one sometimes gets from clients who don't really understand the ratios that are being asked for. The objective is to pare the data down to the "reasonable" values and get the bulk of it for subsequent figuring of quartiles. May I use the Reply to Author button to reply to you directly? James On Jul 9, 1:04?pm, wrote: James, There are many "sexy" ways to do it, but if you only have to do it once or twice.... 1) Calculate the mean and standard deviation 2) Calculate the upper and lower cutoff values 3) Sort the data 4) Manually delete those above/below the cutoff values If this is something you will do on a regular basis, then it may be worth pursing a more automated method. HTH, John |
#4
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
James,
Sorry, that email doesn't exist anymore. The programming is fairly straight forward. I'll do what I can to help you, but if it isn't something you are going to do on a regular basis....manual is the way to go. John |
#5
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my lack of mathematical expertise. I would like to use built-in Excel functions as much as possible to speed the calculation. Your first step, "Calculate the mean". Would I use Excel's AVERAGE function for this? Say column M contains 12,410 numbers, starting on row 2. Would I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A friend has told me that what is really wanted is the "threshold to the second quartile", so I'm confused. James On Jul 9, 1:48?pm, wrote: James, Sorry, that email doesn't exist anymore. The programming is fairly straight forward. I'll do what I can to help you, but if it isn't something you are going to do on a regular basis....manual is the way to go. John |
#6
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
What is the basis for excluding data that is beyond 2.5 Std Dev from the mean.
Why not use the built in quartiles worksheet function against the raw data to get the quartiles from the raw data. If the data is incorrect, throwing out the data you suggest doesn't seem to have much justification to me. You need to get correct data. Nonetheless: Yes =Average($m$2:$M$121411) would give you the average You could use an adjacent empty column to do Assume the data points are in M as you show, in R2 for example =if(Abs(M2-Average($M$2:$M$12411))2.5*Stdev($m$2:$M$121411), "Delete","Keep") then do an autofilter on column R and select Delete in the dropdown. Select all the data except the first row and do Edit=Delete and select entirerow. Note that once you delete the rows (and remove the filter), the formulas will recalculate and new mean and standard deviation and more rows will show up as Delete, but you can ignore these. -- Regards, Tom Ogilvy then drag fill down to "Zone" wrote: John, I understand. I'll plow away at the problem in here in hopes that you will stick with me for a while and not get irritated with my lack of mathematical expertise. I would like to use built-in Excel functions as much as possible to speed the calculation. Your first step, "Calculate the mean". Would I use Excel's AVERAGE function for this? Say column M contains 12,410 numbers, starting on row 2. Would I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A friend has told me that what is really wanted is the "threshold to the second quartile", so I'm confused. James On Jul 9, 1:48?pm, wrote: James, Sorry, that email doesn't exist anymore. The programming is fairly straight forward. I'll do what I can to help you, but if it isn't something you are going to do on a regular basis....manual is the way to go. John |
#7
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
James,
I'm thinking like Tom....what is it exactly you are trying to get to? Are you trying to eliminate outliers from your dataset?....all those beyond 2.5 std devs. Or are you trying to find the 25th and 75th quartiles? John |
#8
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Tom and John,
I am paying very close attention to everything you say. The problem is that for a particular ratio in a particular industry (such as supply expense per client per day), it is known and published by the experts that the figure should be in the range of a few hundred dollars. Depending on other factors (which are captured), this could vary between $200 or so and maybe $700. Problem is, the incoming data includes figures such as 22 cents and $125,000, both of which are obviously incorrect. Alas, we have no control over the incoming data, so the best we can do is to try to grab the "reasonable" figures that are the bulk of the data and ignore the "unreasonable" figures that have resulted from data input errors, misunderstanding of what is being asked, etc. What is wanted is Top, Second and Third Quartiles for each indicator. These cannot be figured unless the ridiculous data is excluded. Tom, I presume that when you say the AVERAGE function will get me the average, this is the same as the mean, since neither StdDevs nor quartiles can be figured without the mean, to my limited mathematical knowledge. I acually wasn't aware that there were a spreadsheet function for quartiles, so that is interesting. I take it that this reveals true quartiles, not just counting down 3 in a set of 12 data, for instance. I am grateful to both of you for your interest in my problem. I readily admit to being mathematically obtuse. I have hundreds of indicators to run, so I'm looking for a fast and efficient way to handle this. James On Jul 9, 2:56?pm, wrote: James, I'm thinking like Tom....what is it exactly you are trying to get to? Are you trying to eliminate outliers from your dataset?....all those beyond 2.5 std devs. Or are you trying to find the 25th and 75th quartiles? John |
#9
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
quartile returns the quartile value, not the position of the quartile. if
you had 12 random numbers between 100 and 1000, quartile(A1:A12,1) would return the appropriate number between 100 and 1000, not 3.75 which would be the position of the first quartile. See Excel help for details. Average returns the arithmetic mean. stdev returns the sample standard deviation. stdevp returns the population standard deviation. Neither requires you to calculate the arithmetic mean separately, but of course you would need that value to determine the distance your datapoint is from the mean; but I already provided that information to you. -- Regards, Tom Ogilvy "Zone" wrote: Tom and John, I am paying very close attention to everything you say. The problem is that for a particular ratio in a particular industry (such as supply expense per client per day), it is known and published by the experts that the figure should be in the range of a few hundred dollars. Depending on other factors (which are captured), this could vary between $200 or so and maybe $700. Problem is, the incoming data includes figures such as 22 cents and $125,000, both of which are obviously incorrect. Alas, we have no control over the incoming data, so the best we can do is to try to grab the "reasonable" figures that are the bulk of the data and ignore the "unreasonable" figures that have resulted from data input errors, misunderstanding of what is being asked, etc. What is wanted is Top, Second and Third Quartiles for each indicator. These cannot be figured unless the ridiculous data is excluded. Tom, I presume that when you say the AVERAGE function will get me the average, this is the same as the mean, since neither StdDevs nor quartiles can be figured without the mean, to my limited mathematical knowledge. I acually wasn't aware that there were a spreadsheet function for quartiles, so that is interesting. I take it that this reveals true quartiles, not just counting down 3 in a set of 12 data, for instance. I am grateful to both of you for your interest in my problem. I readily admit to being mathematically obtuse. I have hundreds of indicators to run, so I'm looking for a fast and efficient way to handle this. James On Jul 9, 2:56?pm, wrote: James, I'm thinking like Tom....what is it exactly you are trying to get to? Are you trying to eliminate outliers from your dataset?....all those beyond 2.5 std devs. Or are you trying to find the 25th and 75th quartiles? John |
#10
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Hi Zone,
What makes you certain that 2.5 standard deviations is the right tool for the job? I fathom you've been given a "left handed monkey wrench". Tip: there is no such thing as a "left handed monkey wrench". To put things in perspective: Without qualifying a particular probablity density function, a standard deviation is associated a normal curve - something that is equally skewed to the left and right and looks like a single hump with the middle of the distribution correspoding to both the mean and median.The height of the hump and thickness of the data from the mean can vary. So what does 2.5 standard deviations correspond to? Assuming a normal curve of mean 1 and STD 1: Data outside of 1 STD corresponds to (assuming you're looking at both the left and right tail) 33% (about 16% on each side) Data outside of 2 STDs corresponds to 2.5% Data outside of 2.5 STDs so closely corresponds to zero that it's not within the tolerance of error computation. You're talking 1 out of 100,000 (That's my guess). This probablity is so small that you won't find the value a published table on the normal curve. In your case it's unlikely the data is distributed normally. Also, supposed that you get 100,000 observations, that means that on average you'll only be excluding 1 outlier. Most of the time you won't be excluding any values. Also, suppose there are 1000 outliers in 100,000 observations, that means that, if you use the 2.5 STD approach you're going to end up keeping most of the hosed observations. In other words 2.5 STDs isn't the tool for this job. What you need to do is filter the data, just as you described. Set limits on minimum and maximum values and exclude anything smaller or larger. IF you're working with statisticians and you want to talk the jargon, you should compute the skew statistic, approximate the statistical distribution (that might a a Pareto or Beta distribution). At the least you should confirm your assumption that the data is normally distributed. And you should analyze whether assuming a normal distribution is going to do the job. As to fast ways of dealing with the data ... What is it that is so time consuming? Just filter the data. What's time consuming about that? |
#11
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Tom, Yes, I understand. The reason I asked is that I understand that
one statistical method is to throw out the highest and lowest numbers, then, if 12 numbers are left, count down 3 and take that number as the 1st quartile. Say I have these numbers: 0.2 1 3.25 5 17 50 125 500 1250 1500 32505 50000 Assuming the most absurd numbers were already thrown out (leaving only a few absurd numbers), counting down 3 and returning the number would be 3.25. Obviously this is neither the position or the value of the 3rd number. However, say I want to get the center value of the dataset. =QUARTILE(M2:M13,2) gives 87.5, which does not appear to me to be the center, mathematically. =AVERAGE(M2:M13) gives 7163.04, which also does not appear to be the center. There is quite a difference between these two numbers! The quartile value Excel gives seems to lie in the middle of the quartile, not on the threshold. If the expert says that the figure should be on the order of hundreds of dollars, then conceivably everything below 50 and above 1500 could be excluded. Unfortunately, I do not have an exact range for each of these hundreds of indicators, so I'm trying to throw out outliers by StdDevs (which, by the way, I can set to 2, 2.5, or 3). After throwing out the outliers, I want to get meaningful figures for the 1st, 2nd and 3rd Quartiles. These will be used for judging performance of the clients. This demonstrates my dilemma. I would prefer not to go into hundreds of lines of VBA code to render meaningful figures. Any further guidance from you or John would be appreciated. James On Jul 9, 3:54?pm, Tom Ogilvy wrote: quartile returns the quartile value, not the position of the quartile. if you had 12 random numbers between 100 and 1000, quartile(A1:A12,1) would return the appropriate number between 100 and 1000, not 3.75 which would be the position of the first quartile. See Excel help for details. Average returns the arithmetic mean. stdev returns the sample standard deviation. stdevp returns the population standard deviation. Neither requires you to calculate the arithmetic mean separately, but of course you would need that value to determine the distance your datapoint is from the mean; but I already provided that information to you. -- Regards, Tom Ogilvy "Zone" wrote: Tom and John, I am paying very close attention to everything you say. The problem is that for a particular ratio in a particular industry (such as supply expense per client per day), it is known and published by the experts that the figure should be in the range of a few hundred dollars. Depending on other factors (which are captured), this could vary between $200 or so and maybe $700. Problem is, the incoming data includes figures such as 22 cents and $125,000, both of which are obviously incorrect. Alas, we have no control over the incoming data, so the best we can do is to try to grab the "reasonable" figures that are the bulk of the data and ignore the "unreasonable" figures that have resulted from data input errors, misunderstanding of what is being asked, etc. What is wanted is Top, Second and Third Quartiles for each indicator. These cannot be figured unless the ridiculous data is excluded. Tom, I presume that when you say the AVERAGE function will get me the average, this is the same as the mean, since neither StdDevs nor quartiles can be figured without the mean, to my limited mathematical knowledge. I acually wasn't aware that there were a spreadsheet function for quartiles, so that is interesting. I take it that this reveals true quartiles, not just counting down 3 in a set of 12 data, for instance. I am grateful to both of you for your interest in my problem. I readily admit to being mathematically obtuse. I have hundreds of indicators to run, so I'm looking for a fast and efficient way to handle this. James On Jul 9, 2:56?pm, wrote: James, I'm thinking like Tom....what is it exactly you are trying to get to? Are you trying to eliminate outliers from your dataset?....all those beyond 2.5 std devs. Or are you trying to find the 25th and 75th quartiles? John- Hide quoted text - - Show quoted text - |
#12
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
These numbers aren't normally distributed so STDs are the wrong too
for the job. Say I have these numbers: 0.2 1 3.25 5 17 50 125 500 1250 1500 32505 50000 You need to filter the data - just like John lined out. If you want a no code solution ask the client to filter the data on his end. I still don't understand why you think filtering the data requires hundreds of lines of VBA. |
#13
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Gimme (Tom and John),
You're absolutely right. If the dataset is not a normal distribution, then Excel's built-in functions Average, StdDev and Quartiles are of no use whatsoever. What I didn't mention is that my friend is an amateur mathematician. He created a model that can take gamma distributions into account. I have very litlte idea of what that means except that the program's model includes special functions which can derive Average, StdDev and Quartiles based on a model that morphs from normal to chi-squared. I basically want to speed up his calculations using built-in Excel functions where possible. I think I understand now why his algorithm is so complicated. Thanks, guys. If you care to continue this thread, please post any thoughts you might have. Best regards, James Indeed, as you brought up, On Jul 9, 5:14?pm, " wrote: These numbers aren't normally distributed so STDs are the wrong too for the job. Say I have these numbers: 0.2 1 3.25 5 17 50 125 500 1250 1500 32505 50000 You need to filter the data - just like John lined out. If you want a no code solution ask the client to filter the data on his end. I still don't understand why you think filtering the data requires hundreds of lines of VBA. |
#14
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
"Zone" wrote in message
ups.com... The problem is that the data is very "raw" -- about 12,000 datapoints ranging from 0.22 to over 100,000. You probably know that this is the kind of mess one sometimes gets from clients who don't really understand the ratios that are being asked for. Hi Zone, if you want to use the 2.5 StDev rule for removing outliers than your data should be normally distributed.Ratios are known to be notoriously skewed. 1) Do a transform ; a log- or arcsin transformation could be usefull 2) Do a Kolmogorov-Smirnov test to see if you now have a normal distribution 3) Do the outlier test on the new data Use statistics software if you can , SAS , Statistica , R , S hth Gys |
#15
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
If you find yourself doing a lot of statistical calculations on rows
and columns of data, and if you're in the project for the long haul, check out (and Google) JExcel. It's definitely the tool for the job when it comes to fetching arrays of data from Excel, analyzing the data, and stuffing results back into Excel. You mentioned you didn't want to spend a lot of time writing code, so this might not be for you. But still it's interesting. If you knew J perhaps it would be the tool for the job. JExcel was created by a developer from Morgan Stanley. |
#16
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
mean median mode quartiles variance, standard deviations are not specific to
the normal distribution nor do they have any implicit assumption of normality. They are called descriptive statistics and are used to describe and analyze your data. If you have faith in your friend, then it is unclear why you are posting here. You have been give suggestions on how you could "trim" you data none of which require code and you could have completed them in less than 1% of the time this thread has existed. -- Regards, Tom Ogilvy "Zone" wrote: Gimme (Tom and John), You're absolutely right. If the dataset is not a normal distribution, then Excel's built-in functions Average, StdDev and Quartiles are of no use whatsoever. What I didn't mention is that my friend is an amateur mathematician. He created a model that can take gamma distributions into account. I have very litlte idea of what that means except that the program's model includes special functions which can derive Average, StdDev and Quartiles based on a model that morphs from normal to chi-squared. I basically want to speed up his calculations using built-in Excel functions where possible. I think I understand now why his algorithm is so complicated. Thanks, guys. If you care to continue this thread, please post any thoughts you might have. Best regards, James Indeed, as you brought up, On Jul 9, 5:14?pm, " wrote: These numbers aren't normally distributed so STDs are the wrong too for the job. Say I have these numbers: 0.2 1 3.25 5 17 50 125 500 1250 1500 32505 50000 You need to filter the data - just like John lined out. If you want a no code solution ask the client to filter the data on his end. I still don't understand why you think filtering the data requires hundreds of lines of VBA. |
#17
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Thanks, everyone, for the great advice. Much appreciated. I should
be able to tackle this now with more confidence. James On Jul 9, 10:56?pm, Tom Ogilvy wrote: mean median mode quartiles variance, standard deviations are not specific to the normal distribution nor do they have any implicit assumption of normality. They are called descriptive statistics and are used to describe and analyze your data. If you have faith in your friend, then it is unclear why you are posting here. You have been give suggestions on how you could "trim" you data none of which require code and you could have completed them in less than 1% of the time this thread has existed. -- Regards, Tom Ogilvy "Zone" wrote: Gimme (Tom and John), You're absolutely right. If the dataset is not a normal distribution, then Excel's built-in functions Average, StdDev and Quartiles are of no use whatsoever. What I didn't mention is that my friend is an amateur mathematician. He created a model that can take gamma distributions into account. I have very litlte idea of what that means except that the program's model includes special functions which can derive Average, StdDev and Quartiles based on a model that morphs from normal to chi-squared. I basically want to speed up his calculations using built-in Excel functions where possible. I think I understand now why his algorithm is so complicated. Thanks, guys. If you care to continue this thread, please post any thoughts you might have. Best regards, James Indeed, as you brought up, On Jul 9, 5:14?pm, " wrote: These numbers aren't normally distributed so STDs are the wrong too for the job. Say I have these numbers: 0.2 1 3.25 5 17 50 125 500 1250 1500 32505 50000 You need to filter the data - just like John lined out. If you want a no code solution ask the client to filter the data on his end. I still don't understand why you think filtering the data requires hundreds of lines of VBA.- Hide quoted text - - Show quoted text - |
#18
Posted to microsoft.public.excel.programming
|
|||
|
|||
2.5 StdDevs
Tom is correct....those are all "descriptive statistics". But so are
the folks that talk about the normality requirement when handling outliers based on standard deviations. I still don't fully understand what Zone is trying to accomplish. The original post spoke to determining the quartile points, which doesn't require normality. But there was also talk of truncating the data set to those points lying within 2.5 standard deviations of the mean, which does require normality. There are other methods of handling outlying data. You might look into methods for accommodating outlying data (as opposed to eliminating them from the data set). One such method is known as the Winsorized Mean method. If you are looking to understand the dataset, then why not plot it? Create 2 plots, a probability density function (PDF) and a cumulative density function (CDF). These are easy to make, even for very large datasets. A histogram is very similar to a PDF and can be created using the Analysis ToolPak add-in. Google these terms and develop the necessary albeit brief understanding of the concepts. Then you will likely be in a better position to help your friend. John |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|