View Single Post
  #11   Report Post  
Posted to microsoft.public.excel.programming
Tom Ogilvy Tom Ogilvy is offline
external usenet poster
 
Posts: 27,285
Default Automatically substituting missing data

So the question is answered or are you still looking for an answer?

--
Regards,
Tom Ogilvy



"Michael DiCostanzo" wrote in message
om...
Thanks all of you for your quick responses. Let me answer some of the
questions you've raised.

First, there is plenty of historical data (2 years worth), so we're
never going to have a situation where there is no data at the start.
For the purposes of this exercise, you can assume that we have plenty
of data at the beginning.

Second, I need to clarify the rules of substitution better. For
periods of missing data, N [where N is the number of hours of missing
data (< 24 hours)] we take the average of the readings N hours before
and N hours after the period of missing data. We can also have more
than one period of missing data in a 24 hour period.

So in the example I gave, we could have the following situation:

Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 7.525
6 7.525
7 7.45 7.45
8 7.50 7.50
9 7.25 7.25
10 7.30 7.30
11 7.45 7.45
12 7.30 7.30
13 7.30 7.30
14 7.30 7.30
15 7.30 7.30
16 7.45 7.40
17 7.50 7.50
18 7.475
19 7.475
20 7.50 7.50
21 7.45 7.40
22 7.30 7.30
23 7.30 7.30
24 7.30 7.30

For hours 18-19, the missing data would be calculated as (7.45 + 7.5 +
7.5 + 7.45)/4 = 7.475.

In this situation, the calculation has to be done twice. I usually
substitute the missing data a day or two later. In this way I assure
myself that I have sufficient data after the periods of missing data
to do the calculation.

Third, you're going to ask, what about periods of missing data that
are longer than 24 hours? The rules that we must follow state that in
this case we must use the maximum 1 hour value from the previous 30
days, not counting the current day. That value is substituted for all
periods of missing data greater than 24 hours.

Obviously, we try to avoid that situation. I don't expect that we'll
get periods of missing data longer than 24 hours now because I'm
paying very close attention to the quality of the data now.

What I'd like to do is at least handle periods of missing data less
than 24 hours. I'd like to put the raw data into an Excel file and
then copy the formula down to handle the missing data for me.

Thanks again for your response, the formula you provided does work.


Michael



"Frank Kabel" wrote in message

...
Hi Michael
try the following approach (using helper columns to make the formulas a
little bit less complex)
- Add a column B as helper column and enter the following array formula
in B2 (this is your second data row as row 1 always should contain
data). Formula ist entered with CTRL+SHFT+ENTER
=IF(A2="",MIN(IF(A3:$A$20<"",ROW(A3:$A$20),20))," ")
copy down for all rows

- add the helper column C and enter the following formula in C2 (also
array entered)
=IF(A2="",MAX(IF(A$1:A1<"",ROW($A$1:A1),0)),"")
and copy down

- if you like you can hide columns B+C
- now enter the following formula in D2 (this is the column with your
corrected data)
=IF(A2="",SUM(INDIRECT("A" & C2 & ":A" & C2-(B2-C2-2)),INDIRECT("A" &
B2 & ":A" & B2 + (B2-C2 - 2)))/(B2-C2-1),"")
and copy down




Notes:
this approach won't work in all circumstances. e.g.
- the first row is empty (as there is no data before
- the last row ist empty
- There are not enough rows before or after

--
Regards
Frank Kabel
Frankfurt, Germany

Michael DiCostanzo wrote:
I'm doing a project for a client where I have inherited someone

else's
data which is kept in Excel files. Here's what I want to do.

The data is updated each hour. Sometimes we get no data or bad data
for short periods of time. A sample of the data is shown below.

In this example, there is no data for hours 5-6. Since we have two
hours worth of missing data, we are required to substitute for the
missing data by taking the average of the data two hours before and
two hours after the period when the data is missing. (These are the
rules we have to live by.)

In this case, the numbers would be calculated as follows:

(7.60 + 7.55 + 7.45 + 7.50) / 4 = 7.525

The value of 7.525 is then substituted into the empty cells for hours
5-6 as shown in the Corrected_Data column.


Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 7.525
6 7.525
7 7.45 7.45
8 7.50 7.50

If there were 3 hours of missing data, then we would have substituted
for the missing data by taking the average of the data three hours
before and three hours after the period when the data is missing. I
currently do the substitution manually. I really want to append the
raw data into a spreadsheet and then calculate the Corrected_Data
automatically.

Can anyone tell me how to do that in Excel? As I mentioned, I've
inherited this system, so I have to live with it for now. We're
looking to replace it and if the answer is that this can't be done in
Excel, that's OK. I want to transition to a newer system as smoothly
as possible.

Thanks in advance