ExcelBanter - View Single Post

MrShorty wrote:

Based on the information in that article, once could (assuming enough
RAM, processor time, and such) theoretically do a least squares
regression for up 52 parameters, then.

Theoretically yes, but forming the normal equations is a numerically
poor way to do least squares, since numerical singularity is then
governed by the ratio of largest to smallest eigenvalues in X'X instead
of that ratio in X. MINVERSE will often give a numerically inaccurate
solution instead of warning of approximate numerical singularity.
Consequently I would tend distrust the output.

This issue in 1 dimension is the difference between calculating
=SUMSQ(data)-SUM(data)^2/COUNT(data)
(as was done by VAR, STDEV, etc prior to Excel 2003) and calculating
=DEVSQ(data)
The first form becomes more and more inaccurate as the CV of the data
decreases. For example, STDEV(x,x+0.1) should be 0.0707106781186548
=SQRT(0.005) regardless of x. But for x=100000 (CV=0.00007%) the first
form (and STDEV in Excel 2002 and earlier) is inaccurate in the 4th
significant figure of the standard deviation calculation. Between
1000000 and 10000000, all significant figures are lost in the first
calculation (summing squared x's has pushed the interesting part of the
calculation beyond the accuracy if IEEE double precision). While the
second form continues to work well.

Personally, I can't remember ever being asked to analyze data with a CV
that small, (and if I did, I would have used coding (linear
transformation to a more practical range) to avoid numerical problems,
but things are not so clear in the multiple dimensions of the general
linear model. For instance, if you fit a 6th degree polynomial where
the x data is 20.0, 20.5, 21.0, 21.5, ..., 27.0, the condition number of
X'X is around 10^30, suggesting that direct solution of the normal
equations would require quadruple precision to calculate it
http://groups-beta.google.com/group/...9a2bb33e6cdbb8
Pre-2003 LINEST (or the direct MINVERSE() formula linked from my
previous post) will try, but will get no figures correct without
warning. SAS PROC GLM and PROC REG will complain of numerical
singularity (SAS's use of the sweep operator necessarily requires it to
form the normal equations). That is why books on statistical computing
recommend other approaches, such as Givens rotations, as implemented in
http://lib.stat.cmu.edu/apstat/274
http://lib.stat.cmu.edu/apstat/75
the lm() function in S-PLUS and R will get 8 figures correct for each
coefficient in the preceding example despite the ill-conditioning.
Excel's trendline will get 9 figures correct, but will not go beyond a
6th degree polynomial. There is a contributed package for R that would
allow you to pass data from Excel to R, analyze it in R, and pass the
result back to Excel. To learn more about R, go to
http://www.r-project.org

Jerry