View Single Post
  #7   Report Post  
Posted to microsoft.public.excel.programming
joeu2004 joeu2004 is offline
external usenet poster
 
Posts: 2,059
Default how to find union of two arrays

On May 22, 6:34*am, Andrew wrote:
For k1 = 1 to 50000
dataX=worksheets(1).cells(k1,1)
For k2=1 to 50000
dataY= worksheets(2).cells(k2,1)
If dataX=dataY then
worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2 )
end if
next
next

[....]
This loop has to execute 50000^2 loops, 2.5 billion loops.

[....]
I don't know any tricks on how to speed this up.


Well, that depends on details of the data and algorithm that you have
not mentioned.

(Note: I will use Sheet1 for Worksheets(1) and Sheet2 for
Worksheets(2), although those might not be the actual worksheet
names.)

First, your algorithm continues to look for a match even after finding
one. Effectively, it finds the __last__ match. Is that a
requirement? Or is it sufficient to find the __first__ match?

If you want to find the last match, your inner for-loop can be
improved by doing For k2=50000 to 1 step -1.

In either case, the inner loop can stop after finding a match. So you
can put the following statement before End If: Exit For.

Thus, on average a match is found after 25000 comparisons instead of
50000, assuming unique data in both Sheet1!A1:A50000 and Sheet2!
A1:A50000.

That cuts the total iterations in half, down to 1.25 billion.

Second, in your example, Sheet1!A1:A50000 is in ascending order. If
that is always the case, you could use a binary search instead of
linear search. Thus, a match is found in at most 16 comparisions.

That cuts the total iterations down to 800,000 at most, with some
probability that it is much less.

Finally, it is much more efficient to copy Sheet1!A1:B50000 and Sheet2!
A1:A50000 into a Variant variable one time, and to retain the results
in a Variant variable and copy them to Sheet2!B1:B50000 one time.

The time savings can be huge. Every time you access an Excel range,
VBA and Excel processes much exchange information -- two interprocess
communications -- and the Excel process must be scheduled to run while
the VBA process waits.

Moreover, even if Sheet1!A1:A50000 is not in ascending order, it is
probably advantageous to take the time to sort dataX below, if we
implemented a binary search (not included below).

So at a minimum (without the binary search), your VBA algorithm should
be:

Dim dataX, dataY
dataX = Worksheets(1).Range("A1:B50000")
dataY = Worksheets(2).Range("A1:A50000")
redim resultY(1 to 50000, 1)
for k1 = 1 to 50000: for k2 = 1 to 50000
if dataY(k1,1) = dataX(k2,1) then
resultY(k1,2) = dataX(k2,2)
exit for
end if
next: next
Worksheets(2).Range("B1:B50000") = resultY

(Note that I reversed the order of the for-loops. It would make a
difference if you implement a binary search of Sheet1!A1:A50000.)

Other improvements are possible and probably desirable; for example,
avoiding the hardcoded ranges. That is why I use Redim instead of Dim
for resultY.

By reducing the interprocess communications (copying of data between
VBA and Excel), I would be very surprised if that VBA implementation
fails to outperform any of the alternatives mentioned so far: VBA
Match/Index, .Find, and Excel COUNTIF/INDEX/MATCH.

(Although the Excel alternative might be more efficient if IFERROR can
be used instead of COUNTIF.)

I might post relative performance data later, if I get around to it.