Mixed numerical and textual data
On Sep 25, 3:21*pm, "joeu2004" wrote:
"jimbo" wrote:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100.
[....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.
Does anyone know of such a calculation, a similar,
or equivalent calculation?
I think what you have in mind is:
1. Prorate the average of the ungrouped numerical amounts (5 to 100)
* *to the grouped text amounts ("<5", which I assume is 0 to 5).
2. Compute the weighted average of the two.
Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".
AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.
Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.
COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.
The prorated average of grouped amounts ("<5") is:
=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)
And the weighted average of the two groups (numericals and "<5") is:
=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)
Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.
Note: *Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5. *For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...". *However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.
What you have here is "censored" data - that is you know that the data
exists (and where) but not the actual value. If the data has a
recognizable distribution such as the log-normal (ie if your data
ibecomes normal if you log it) then there may well be a way of
estimating the mean. This seems to be the sort of think that you are
hinting at.
As a first step,you could try logging the data and drawing a histogram
of it. If the logged data looks normal with stuff on the left side
missing then you have censored log normal data. xt
|