Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.worksheet.functions
external usenet poster
 
Posts: 28
Default Mixed numerical and textual data

When 50% of a data set is a text value, and the rest is numeric,
typically the average is calculated by either making the text value a
numeric, then calculating the mean, or, ignoring the text value and
average the numeric values only.

For instance, the lowest detectable limit for serum aluminum may be
<5, which is a text value. The remaining data ranges between 5-100.
Making all <5s = 5 will overestimate the mean because the <5 values
range between 0 and 4.9, not 5. By ignoring the <5 is worse because
if the distribution is gaussian, the mean should be close to 47, where
the other approach has a mean around 20.

But there should be a way to take the slope of the distribution as it
approaches the lower reportable limit and the number of text values
then determine the numeric equivalent of the values <5 and then, with
the numeric values, determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar, or equivalent
calculation?
  #2   Report Post  
Posted to microsoft.public.excel.worksheet.functions
external usenet poster
 
Posts: 38
Default Mixed numerical and textual data

On Sep 24, 6:38*am, jimbo wrote:
When 50% of a data set is a text value, and the rest is numeric,
typically the average is calculated by either making the text value a
numeric, then calculating the mean, or, ignoring the text value and
average the numeric values only.

For instance, the lowest detectable limit for serum aluminum may be
<5, which is a text value. The remaining data ranges between 5-100.
Making all <5s = 5 will overestimate the mean because the <5 values
range between 0 and 4.9, not 5. *By ignoring the <5 is worse because
if the distribution is gaussian, the mean should be close to 47, where
the other approach has a mean around 20.

But there should be a way to take the slope of the distribution as it
approaches the lower reportable limit and the number of text values
then determine the numeric equivalent of the values <5 and then, with
the numeric values, determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar, or equivalent
calculation?


Replace "<5" with 2.45.

Or some other value if you know the distribution of the values between
0 & 4.9

Alan Lloyd
  #3   Report Post  
Posted to microsoft.public.excel.worksheet.functions
external usenet poster
 
Posts: 829
Default Mixed numerical and textual data

"jimbo" wrote:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100.

[....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.

Does anyone know of such a calculation, a similar,
or equivalent calculation?


I think what you have in mind is:

1. Prorate the average of the ungrouped numerical amounts (5 to 100)
to the grouped text amounts ("<5", which I assume is 0 to 5).

2. Compute the weighted average of the two.

Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".

AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.

Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.

COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.

The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.

Note: Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5. For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...". However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.

  #4   Report Post  
Posted to microsoft.public.excel.worksheet.functions
external usenet poster
 
Posts: 829
Default Mixed numerical and textual data

Errata....

I wrote:
The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)


I tried to write the formula so that there was no assumption of an upper
bound for the ungrouped numerical amounts, even though you said it is 1000.
In some situations, we might have know the upper bound.

However, obviously that formula does not work correctly if all the numerical
amounts are the same, notably 5 or even 5.01.

So I guess we must always know some upper bound; that makes sense for
mapping the distribution. And MAX(A1:A1000)-5 should be 1000-5 = 95.

  #5   Report Post  
Posted to microsoft.public.excel.worksheet.functions
Xt Xt is offline
external usenet poster
 
Posts: 49
Default Mixed numerical and textual data

On Sep 25, 3:21*pm, "joeu2004" wrote:
"jimbo" wrote:
For instance, the lowest detectable limit for
serum aluminum may be <5, which is a text value.
The remaining data ranges between 5-100.

[....]
But there should be a way to take the slope of
the distribution as it approaches the lower
reportable limit and the number of text values
then determine the numeric equivalent of the
values <5 and then, with the numeric values,
determine the mean of the data set as well as
other statistical measurements.


Does anyone know of such a calculation, a similar,
or equivalent calculation?


I think what you have in mind is:

1. Prorate the average of the ungrouped numerical amounts (5 to 100)
* *to the grouped text amounts ("<5", which I assume is 0 to 5).

2. Compute the weighted average of the two.

Suppose you have 1000 data points in A1:A1000, some numerical and the rest
being "<5".

AVERAGE(A1:A1000) is the average of the ungrouped numerical amounts because
AVERAGE ignores empty cells and cells with text.

Similarly, COUNT(A1:A1000) and MAX(A1:A1000) are the count and max of the
ungrouped numerical amounts.

COUNTA(A1:A1000) is the count of both ungrouped numerical and grouped text
amounts, ignoring any empty cells.

The prorated average of grouped amounts ("<5") is:

=5*(AVERAGE(A1:A1000)-5)/(MAX(A1:A1000)-5)

And the weighted average of the two groups (numericals and "<5") is:

=AVERAGE(A1:A1000)*COUNT(A1:A1000)/COUNTA(A1:A1000)
+ 5*(AVERAGE(A1:A1000)-5)*(COUNTA(A1:A1000)-COUNT(A1:A1000))
/COUNTA(A1:A1000)/(MAX(A1:A1000)-5)

Of course, you can simplify and optimize by calculating AVERAGE(A1:A1000),
COUNT(A1:A1000) and COUNTA(A1:A1000) one time in helper cells.

Note: *Instead of "5*...", you might want to substitute the smallest
measurable amount less than 5. *For example, if your measuring instrument
has a precision of 0.01, use "4.99*..." instead of "5*...". *However, keep
"...-5" as is because that is the lowest measurable value of the ungrouped
numerical amounts.


What you have here is "censored" data - that is you know that the data
exists (and where) but not the actual value. If the data has a
recognizable distribution such as the log-normal (ie if your data
ibecomes normal if you log it) then there may well be a way of
estimating the mean. This seems to be the sort of think that you are
hinting at.
As a first step,you could try logging the data and drawing a histogram
of it. If the logged data looks normal with stuff on the left side
missing then you have censored log normal data. xt
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Tracking textual Data changes beween 2 columns & highlight diff nu Rameses Excel Discussion (Misc queries) 1 March 4th 10 01:27 AM
Tracking textual Data changes beween 2 columns & highlight diff nu Rameses Excel Discussion (Misc queries) 0 March 4th 10 12:45 AM
Conditional Formatting; finding numerical data in mixed text. Philonis Excel Discussion (Misc queries) 2 June 5th 07 11:57 PM
How do I assign a numerical value to a textual entry to calculate percent completion? mheavers Excel Worksheet Functions 2 September 11th 06 07:58 PM
Changing Y-Axis Labels from Numerical to Textual [email protected] Charts and Charting in Excel 1 December 12th 05 10:40 AM


All times are GMT +1. The time now is 12:34 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"