Home |
Search |
Today's Posts |
#1
|
|||
|
|||
t-distribution puzzle in Excel
I tried a couple of standard ways to generate random numbers that
follow Student's t-distribution. Something is very wrong, and I don't know what. Start with a simple example. In column A, I put 1000 uniformly distributed on (0,1) numbers, using RAND(). Then I apply to them the function TINV. E.g., in cell B1, I have =IF(A1<0.5,-TINV(2*A1,df),TINV(2*(1-A1),df)) where df is my degrees of freedom. I fixed df = 5. I obtain a column of (presumably) t-distributed numbers. I then check their excess kurtosis (as calculated by the KURT() function). As I was repeating this exercise using new random numbers, I found the sample kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The theoretical value for the excess kurtosis of t distribution is 6 / (df - 4) = 6, and the theoretical stdev of the sample kurtosis itself is sqrt(24/1000) ~ 0.15. So I am puzzled as to why the excess kurtosis is so incredibly far off. Also puzzled by why it's so unstable. I then replaced Excel's rand() worksheet function with the Mersenne-Twister implementation in NtRand http://www.numtech.com/NtRand/. Similar result (the sample excess kurtosis jumps wildly but still very far below the theoretical value of 6). I then added another column of random uniform (0,1) numbers (in column C), and tried using instead the formula: =NORMSINV(A1)/SQRT(df/GAMMAINV(C1,df/2,2)) Similar result. I tried usign 60,000 instead of 1000 numbers. Now the theoretical stdev of the sample kurtosis is just 0.02. Similar result. I checked the variance and skewness, and those match their values ( df / df - 2 = 1.667 and 0 respectively) very closely. I checked the normal distribution, and its kurtosis matches the theoretical perfectly. I know I'm doing something wrong, but what? =( Thanks.... |
#3
|
|||
|
|||
t-distribution puzzle in Excel
Thanks Jerry! Btw, your function is a LOT faster than Excel's tinv!
But the results are still the same. I'm beginning to wonder.. maybe the distribution of the sample kurtosis is so *terribly* skewed, even for sample of size 60000, that it looks to me like I'm getting a biased result? And maybe the variance is so terribly huge too (I only know the theoretical variance of the sample kurtosis for normal, not t distribution).. |
#4
|
|||
|
|||
t-distribution puzzle in Excel
Please clarify what you did. Are you talking about
1. Using TINV(2*p,df) but only for p<0.5 with an auxiliary Bernoulli variable to decide whether it is the upper or lower tail (multiply t-value by +/-1). 2. Using Ian Smith's inv_tdist(p,df). 3. Using Ian Smith's inv_tdist(p,df) but only for p<0.5 with an auxiliary Bernoulli variable to decide whether it is the upper or lower tail (multiply t-value by +/-1). 4. Something else (please describe) I would be very surprised if you are finding a VBA function to be faster than a native Excel function. In Excel 2002 and earlier versions, =TINV(p,df) returns 5E6 for p<4.3E-7, regardless of df (too extreme even for df=1). Whenever you hit such a value, that will necessarily destabilize your descriptive statistics. Excel 2003 truncates the distribution at 1E7, but the p-value for which TINV returns 1E7 varies with df, and seems to provide around 10 digit accuracy in this tail, so TINV in 2003 might be acceptable to generate random variates, but I still would avoid GAMMAINV for this purpose. By contrast, Ian Smith's VBA functions are as close to machine accuracy as I have seen in any double precision implementation (better than commercial statistics packages, better than commercial math libraries, etc.) Jerry wrote: Thanks Jerry! Btw, your function is a LOT faster than Excel's tinv! But the results are still the same. I'm beginning to wonder.. maybe the distribution of the sample kurtosis is so *terribly* skewed, even for sample of size 60000, that it looks to me like I'm getting a biased result? And maybe the variance is so terribly huge too (I only know the theoretical variance of the sample kurtosis for normal, not t distribution).. |
#5
|
|||
|
|||
t-distribution puzzle in Excel
Jerry,
I'm ignoring all p 0.5. Then I'm using inv_tdist(p,df) if q < 0.5 inv_tidst(1-p,df) if q 0.5, where q = rand() is an uniform r.v. from Excel (independent from the previous one). I don't why it's faster, but it is! At least 5 times faster, if not 10 times. I'm using Excel 2003. But I think I found that my problem is in the statistical domain, not in the computation. See this thread: http://groups.google.com/group/sci.s...83e1dd956d59e/ or sci.stat.math thread with subject "t distribution -- sample kurtosis" if this link doesn't work. (Anyone knows how to link correctly to threads?..) Btw, is inv_tdist in public domain? Thank you! |
#6
|
|||
|
|||
t-distribution puzzle in Excel
mm wrote: .... Btw, is inv_tdist in public domain? .... The first line of the VBA code is a copyright notice, so by definition, it is not public domain. However, if Ian Smith had not intended for individuals to use his code, then I doubt that he would have published the source code in a public place. If you are wanting to use it commercially, I would recommend e-mailing him. His address ) is clearly displayed on his web page http://members.aol.com/iandjmsmith/ and I have found him to respond readily to communications regarding his numerical analysis efforts. Ian? Jerry |
#7
|
|||
|
|||
t-distribution puzzle in Excel
Jerry W. Lewis wrote: mm wrote: ... Btw, is inv_tdist in public domain? ... The first line of the VBA code is a copyright notice, so by definition, it is not public domain. However, if Ian Smith had not intended for individuals to use his code, then I doubt that he would have published the source code in a public place. If you are wanting to use it commercially, I would recommend e-mailing him. His address ) is clearly displayed on his web page http://members.aol.com/iandjmsmith/ and I have found him to respond readily to communications regarding his numerical analysis efforts. Ian? Jerry Any time I suggest someone might wish to use the code, I try to make it clear that they can do anything they like with it. I place no restrictions and give no guarantees although I believe the code is pretty much as good as you will find. I do not see the point in having everyone write code for the same old stuff over and over again. It seems to defeat the main point of computer programs which is that you only have to get it right once and it'll work forever. I have no problems with the code being used commercially. Should anyone make a fortune because of its use, they are more than welcome to shower me with gifts/donations. So Microsoft, please note, I have no problem if this code or better is added to Excel making my VBA code redundant! The code has a copyright notice at the top because when I first made the code available, I consulted some online document on copyright issues and it said I should include it. It also said I have the copyright whether or not I add it in because I produced the code or some such stuff. Please note, the code for AS241 is merely a translation and so from a copyright point of view has nothing to do with me. This may have implications about it's commercial use. I believe it is ok to use AS241 (and translations of it) commercially but one thing I am not is a legal expert. I do not even know exactly what "in the public domain" implies. The main purpose of the code is to provide accurate methods of performing basic calculations for statistical distributions. It also shows how simple modifications to techniques such as Newton-Raphson can improve them enormously. I think that's why the inv_tdist code is much faster than the Excel TINV code despite VBA not being particularly quick. Excel claims to use a binary search which would require the TDIST function to be called lots of times. I have not seen the modified NR approach call the cdf function more than 7 times. The code also includes asymptotic formulae and continued fraction formulae which I've discovered over the years and have not seen elsewhere. Lastly, it shows how write statistical functions without using a function to evaluate the log of the gamma function - a pet hate of mine because its use guarantees large errors if applied to large arguments. Again you only have to get it right once and unnecessary errors of that type need never reappear. What the code is not intended to do is probably just as important. It is not intended to be the fastest, the most accurate or best in any sense. I believe all of the code can be improved (made faster or more accurate or work with logs of probabilities to avoid returning probabilities of 0 ...) but the code is a good place to start from. The code can usually be modified easily. If, for example, you wish to calculate functions of the t-distribution for non-integral degrees of freedom just comment out the lines like df = Fix(df) in the pdf, cdf and inv functions. The code does not make use of the fact that it is restricted to integral degrees of freedom. Ian Smith |
#8
|
|||
|
|||
t-distribution puzzle in Excel
repeating this exercise using new random numbers, I found the sample
kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The theoretical value for the excess kurtosis of t distribution is 6 / (df - 4) = 6, Hi. I'm not a Stats guy, so this is just for gee wiz. Just for feedback, the math program Mathematica appears to agree with Excel. Most of the values returned by KurtosisExcess are below 6. I "think" that Excel (2003) is doing it correctly! These values agree with what you mentioned... Variance[StudentTDistribution[5]]} 5/3 (or 1.6666...) KurtosisExcess[StudentTDistribution[5]] 6 ( I know this isn't Excel, but I thought it would be easier to show it this way instead of just the answers.) = = = = = = = = = = = Example #1 Data size = 30,000. The value of KurtosisExcess jumps around. Some values were below 3, and above 10. df =5. Here's 10 samples... t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 30000]], {10}] {5.111078, 4.0498174, 3.8689816, 4.5970144, 7.0692163, 3.2919761, 5.9771902, 4.1000405, 3.23158, 4.8915233} {Mean[t], StandardDeviation[t]} {4.6188418, 1.2021682} = = = = = = = = = = = Example #2 Data size = 300,000 (Larger size) t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 300000]], {10}] {4.4799506, 4.6979051, 7.693889, 4.8315766, 6.7575222, 4.685617, 4.415599, 4.9943591, 6.166258, 4.404873} {Mean[t], StandardDeviation[t]} {5.312755, 1.1504864} As you can see, KurtosisExcess is usually below 6 and varies quite a bit. Maybe someone good at Stats can jump in and confirm. -- Dana DeLouis Win XP & Office 2003 wrote in message oups.com... I tried a couple of standard ways to generate random numbers that follow Student's t-distribution. Something is very wrong, and I don't know what. Start with a simple example. In column A, I put 1000 uniformly distributed on (0,1) numbers, using RAND(). Then I apply to them the function TINV. E.g., in cell B1, I have =IF(A1<0.5,-TINV(2*A1,df),TINV(2*(1-A1),df)) where df is my degrees of freedom. I fixed df = 5. I obtain a column of (presumably) t-distributed numbers. I then check their excess kurtosis (as calculated by the KURT() function). As I was repeating this exercise using new random numbers, I found the sample kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The theoretical value for the excess kurtosis of t distribution is 6 / (df - 4) = 6, and the theoretical stdev of the sample kurtosis itself is sqrt(24/1000) ~ 0.15. So I am puzzled as to why the excess kurtosis is so incredibly far off. Also puzzled by why it's so unstable. I then replaced Excel's rand() worksheet function with the Mersenne-Twister implementation in NtRand http://www.numtech.com/NtRand/. Similar result (the sample excess kurtosis jumps wildly but still very far below the theoretical value of 6). I then added another column of random uniform (0,1) numbers (in column C), and tried using instead the formula: =NORMSINV(A1)/SQRT(df/GAMMAINV(C1,df/2,2)) Similar result. I tried usign 60,000 instead of 1000 numbers. Now the theoretical stdev of the sample kurtosis is just 0.02. Similar result. I checked the variance and skewness, and those match their values ( df / df - 2 = 1.667 and 0 respectively) very closely. I checked the normal distribution, and its kurtosis matches the theoretical perfectly. I know I'm doing something wrong, but what? =( Thanks.... |
#9
|
|||
|
|||
t-distribution puzzle in Excel
Dana DeLouis wrote:
repeating this exercise using new random numbers, I found the sample kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The theoretical value for the excess kurtosis of t distribution is 6 / (df - 4) = 6, Hi. I'm not a Stats guy, so this is just for gee wiz. Just for feedback, the math program Mathematica appears to agree with Excel. Most of the values returned by KurtosisExcess are below 6. I "think" that Excel (2003) is doing it correctly! These values agree with what you mentioned... Variance[StudentTDistribution[5]]} 5/3 (or 1.6666...) KurtosisExcess[StudentTDistribution[5]] 6 ( I know this isn't Excel, but I thought it would be easier to show it this way instead of just the answers.) = = = = = = = = = = = Example #1 Data size = 30,000. The value of KurtosisExcess jumps around. Some values were below 3, and above 10. df =5. Here's 10 samples... t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 30000]], {10}] {5.111078, 4.0498174, 3.8689816, 4.5970144, 7.0692163, 3.2919761, 5.9771902, 4.1000405, 3.23158, 4.8915233} {Mean[t], StandardDeviation[t]} {4.6188418, 1.2021682} = = = = = = = = = = = Example #2 Data size = 300,000 (Larger size) t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 300000]], {10}] {4.4799506, 4.6979051, 7.693889, 4.8315766, 6.7575222, 4.685617, 4.415599, 4.9943591, 6.166258, 4.404873} {Mean[t], StandardDeviation[t]} {5.312755, 1.1504864} As you can see, KurtosisExcess is usually below 6 and varies quite a bit. Maybe someone good at Stats can jump in and confirm. -- Dana DeLouis Win XP & Office 2003 wrote in message oups.com... I tried a couple of standard ways to generate random numbers that follow Student's t-distribution. Something is very wrong, and I don't know what. Start with a simple example. In column A, I put 1000 uniformly distributed on (0,1) numbers, using RAND(). Then I apply to them the function TINV. E.g., in cell B1, I have =IF(A1<0.5,-TINV(2*A1,df),TINV(2*(1-A1),df)) where df is my degrees of freedom. I fixed df = 5. I obtain a column of (presumably) t-distributed numbers. I then check their excess kurtosis (as calculated by the KURT() function). As I was repeating this exercise using new random numbers, I found the sample kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The theoretical value for the excess kurtosis of t distribution is 6 / (df - 4) = 6, and the theoretical stdev of the sample kurtosis itself is sqrt(24/1000) ~ 0.15. So I am puzzled as to why the excess kurtosis is so incredibly far off. Also puzzled by why it's so unstable. I then replaced Excel's rand() worksheet function with the Mersenne-Twister implementation in NtRand http://www.numtech.com/NtRand/. Similar result (the sample excess kurtosis jumps wildly but still very far below the theoretical value of 6). I then added another column of random uniform (0,1) numbers (in column C), and tried using instead the formula: =NORMSINV(A1)/SQRT(df/GAMMAINV(C1,df/2,2)) Similar result. I tried usign 60,000 instead of 1000 numbers. Now the theoretical stdev of the sample kurtosis is just 0.02. Similar result. I checked the variance and skewness, and those match their values ( df / df - 2 = 1.667 and 0 respectively) very closely. I checked the normal distribution, and its kurtosis matches the theoretical perfectly. I know I'm doing something wrong, but what? =( Thanks.... As the OP said the discussion on the statistical properties of the sample kurtosis goes on at sci.stat.math and no new fault in Excel has been discovered. What has been discovered is how surprisingly slow the Excel inverse functions are. I knew they were slow in extreme situations but I had not appreciated how slow they were in general. Out of curiousity, I then looked at http://support.microsoft.com/default...21120121120120 to see what Microsoft suggest on generating random variables. As you can see random number generation via the Analysis Toolpak is no longer recommended. Instead they suggest =NORMINV(RAND(), mu, sigma) for generating values from the normal distribution with mean mu and variance sigma^2. Curious about the performance I copied the formula =normsinv(rand()) into the range a1:j20000 to see how long it would take. On my machine it took 154 seconds which is fairly terrible - spreadsheet overheads notwithstanding. I then tried inv_normal(rand()) in the same range and it took 4 seconds. Lastly I tried the Box-Muller approach using =SQRT(-2*LN(RAND()))*COS(2*PI()*RAND()) in the same range and this took about 1 second. Maybe the random number generation section of the page referred to above could do with updating. Ian Smith |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Excel 2003 FAILS, but Excel 2000 SUCCEEDS ??? | Excel Discussion (Misc queries) | |||
Opening two separate instances of Excel | Excel Discussion (Misc queries) | |||
Stop Excel Rounding Dates | Excel Discussion (Misc queries) | |||
Excel error - Startup (and Acrobat PDFMaker) | Setting up and Configuration of Excel | |||
how do i draw a distribution chart in excel | Charts and Charting in Excel |