View Single Post
  #5   Report Post  
Posted to microsoft.public.excel.worksheet.functions
Xt Xt is offline
external usenet poster
 
Posts: 49
Default Mixed numerical and textual data

On Sep 25, 3:21*pm, "joeu2004" wrote:
"jimbo" wrote:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100.

[....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.


Does anyone know of such a calculation, a similar,
or equivalent calculation?


I think what you have in mind is:

1. Prorate the average of the ungrouped numerical amounts (5 to 100)
* *to the grouped text amounts ("<5", which I assume is 0 to 5).

2. Compute the weighted average of the two.

Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".

AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.

Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.

COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.

The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.

Note: *Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5. *For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...". *However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.


What you have here is "censored" data - that is you know that the data
exists (and where) but not the actual value. If the data has a
recognizable distribution such as the log-normal (ie if your data
ibecomes normal if you log it) then there may well be a way of
estimating the mean. This seems to be the sort of think that you are
hinting at.
As a first step,you could try logging the data and drawing a histogram
of it. If the logged data looks normal with stuff on the left side
missing then you have censored log normal data. xt