Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Hello,
Does anyone know a means of comparing two very large arrays (50000 elements) to determine which elements are common to both arrays? I've tried using looping functions and that is far too slow. |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
hi,
i don't know why your looping is far too slow, it should not, can you show your code -- isabelle Le 2011-05-21 11:43, Andrew a écrit : Hello, Does anyone know a means of comparing two very large arrays (50000 elements) to determine which elements are common to both arrays? I've tried using looping functions and that is far too slow. |
#3
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 21, 7:43*pm, isabelle wrote:
hi, i don't know why your looping is far too slow, it should not, can you show your code -- isabelle Le 2011-05-21 11:43, Andrew a écrit : Hello, Does anyone know a means of comparing two very large arrays (50000 elements) to determine which elements are common to both arrays? *I've tried using looping functions and that is far too slow. Isabel, Imagine one range (A) with 50,000 rows, 2 columns, a second range with 50,000 rows (B) 1 column. I want to find all the elements in B which have matches in column 1 of A. For each match, I want to copy the corresponding entry in A column 2 over to a new column 2 in B. Example. A= 1 a 2 f 3 u 4 w 5 q B= 5 1 6 8 9 After running the code, B would be transformed to: B= 5 q 1 a 8 9 6 To do this I create the following loops. For k1 = 1 to 50000 dataX=worksheets(1).cells(k1,1) For k2=1 to 50000 dataY= worksheets(2).cells(k2,1) If dataX=dataY then worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2 ) end if next next This code is accurate, but terrribly slow. This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. My CPU has a 2.5 GHz processor, so this ideally would take 20 seconds. It takes more than that, about 45 seconds. If I were doing this in Matlab or C, I could process this much data in less than a second, but not because the CPU is faster. There are other programming constructs available which are faster. I know Excel can run faster than this, but I don't know any tricks on how to speed this up. Any ideas? |
#4
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
hi Andrew,
with the functions "Index" and "Match" you can reduce the number of loop Code:
Sub Macro1() Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer Set Wks1 = Worksheets(1) Set Wks2 = Worksheets(2) For k2 = 1 To 5 If Not IsError(Application.Match(Wks2.Cells(k2, 1), Wks1.Range("A:A"), 0)) Then Wks2.Cells(k2, 2) = Application.Index(Wks1.Range("B:B"), Application.Match(Wks2.Cells(k2, 1), Wks1.Range("A:A"), 0)) End If Next Set Wks1 = Nothing Set Wks2 = Nothing End Sub -- isabelle Le 2011-05-22 09:34, Andrew a écrit : On May 21, 7:43 pm, wrote: hi, i don't know why your looping is far too slow, it should not, can you show your code -- isabelle Le 2011-05-21 11:43, Andrew a écrit : Hello, Does anyone know a means of comparing two very large arrays (50000 elements) to determine which elements are common to both arrays? I've tried using looping functions and that is far too slow. Isabel, Imagine one range (A) with 50,000 rows, 2 columns, a second range with 50,000 rows (B) 1 column. I want to find all the elements in B which have matches in column 1 of A. For each match, I want to copy the corresponding entry in A column 2 over to a new column 2 in B. Example. A= 1 a 2 f 3 u 4 w 5 q B= 5 1 6 8 9 After running the code, B would be transformed to: B= 5 q 1 a 8 9 6 To do this I create the following loops. For k1 = 1 to 50000 dataX=worksheets(1).cells(k1,1) For k2=1 to 50000 dataY= worksheets(2).cells(k2,1) If dataX=dataY then worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2 ) end if next next This code is accurate, but terrribly slow. This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. My CPU has a 2.5 GHz processor, so this ideally would take 20 seconds. It takes more than that, about 45 seconds. If I were doing this in Matlab or C, I could process this much data in less than a second, but not because the CPU is faster. There are other programming constructs available which are faster. I know Excel can run faster than this, but I don't know any tricks on how to speed this up. Any ideas? |
#5
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Imagine one range (A) with 50,000 rows, 2 columns, a second range with
50,000 rows (B) 1 column. I want to find all the elements in B which have matches in column 1 of A. For each match, I want to copy the corresponding entry in A column 2 over to a new column 2 in B. Example. A= 1 a 2 f 3 u 4 w 5 q B= 5 1 6 8 9 After running the code, B would be transformed to: B= 5 q 1 a 8 9 6 To do this I create the following loops. For k1 = 1 To 50000 dataX = Worksheets(1).Cells(k1, 1) For k2 = 1 To 50000 dataY = Worksheets(2).Cells(k2, 1) If dataX = dataY Then Worksheets(1).Cells(k1, 2) = Worksheets(2).Cells(k2, 2) End If Next Next This code is accurate, but terrribly slow. This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. My CPU has a 2.5 GHz processor, so this ideally would take 20 seconds. It takes more than that, about 45 seconds. If I were doing this in Matlab or C, I could process this much data in less than a second, but not because the CPU is faster. There are other programming constructs available which are faster. I know Excel can run faster than this, but I don't know any tricks on how to speed this up. Why use code at all? You can do this with straight Excel formulas. Assuming your range "A" is located on Sheet1, Columns A and B, starting in Row 1; and assuming your range "B" is located in Column A starting in Row 1 on any other sheet, put this formula in B1 on that other sheet and copy it down... =IF(COUNTIF(Sheet1!A$1:A$500,A1),MATCH(A1,Sheet1!A $1:A$500,0),"") Rick Rothstein (MVP - Excel) |
#6
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On Sun, 22 May 2011 06:34:35 -0700 (PDT), Andrew wrote:
On May 21, 7:43*pm, isabelle wrote: hi, i don't know why your looping is far too slow, it should not, can you show your code -- isabelle Le 2011-05-21 11:43, Andrew a écrit : Hello, Does anyone know a means of comparing two very large arrays (50000 elements) to determine which elements are common to both arrays? *I've tried using looping functions and that is far too slow. Isabel, Imagine one range (A) with 50,000 rows, 2 columns, a second range with 50,000 rows (B) 1 column. I want to find all the elements in B which have matches in column 1 of A. For each match, I want to copy the corresponding entry in A column 2 over to a new column 2 in B. Example. A= 1 a 2 f 3 u 4 w 5 q B= 5 1 6 8 9 After running the code, B would be transformed to: B= 5 q 1 a 8 9 6 To do this I create the following loops. For k1 = 1 to 50000 dataX=worksheets(1).cells(k1,1) For k2=1 to 50000 dataY= worksheets(2).cells(k2,1) If dataX=dataY then worksheets(1).cells(k1,2)=worksheets(2).cells(k2, 2) end if next next This code is accurate, but terrribly slow. This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. My CPU has a 2.5 GHz processor, so this ideally would take 20 seconds. It takes more than that, about 45 seconds. If I were doing this in Matlab or C, I could process this much data in less than a second, but not because the CPU is faster. There are other programming constructs available which are faster. I know Excel can run faster than this, but I don't know any tricks on how to speed this up. Any ideas? If you don't want to do it with worksheet formulas, as suggested by Rick, I think the following code will only have to execute 50,000 loops, so should run quicker. It does assume that your RangeA is two columns and RangeB is one column, as you wrote. It also assumes that the data in RangeA is not the result of formulas, but that can obviously be changed in the code. The code also assumes that if there are any duplicates in RangeA column1, that you only want to return the first one for the corresponding RangeB entry. =================================== Option Explicit Sub RangeAtoB() Dim arB As Variant Dim i As Long Dim c As Range 'since RangeB is one column, must make it two arB = Range("RangeB").Resize(columnsize:=2) With Range("RangeA").Resize(columnsize:=1) For i = 1 To UBound(arB) Set c = .Find(what:=arB(i, 1), _ after:=.Item(1), _ LookIn:=xlValues, _ lookat:=xlWhole) If Not c Is Nothing Then arB(i, 2) = c.Offset(columnoffset:=1) Next i End With Range("rangeB").Resize(columnsize:=2) = arB End Sub ============================== |
#7
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 22, 6:34*am, Andrew wrote:
For k1 = 1 to 50000 dataX=worksheets(1).cells(k1,1) For k2=1 to 50000 dataY= worksheets(2).cells(k2,1) If dataX=dataY then worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2 ) end if next next [....] This loop has to execute 50000^2 loops, 2.5 billion loops. [....] I don't know any tricks on how to speed this up. Well, that depends on details of the data and algorithm that you have not mentioned. (Note: I will use Sheet1 for Worksheets(1) and Sheet2 for Worksheets(2), although those might not be the actual worksheet names.) First, your algorithm continues to look for a match even after finding one. Effectively, it finds the __last__ match. Is that a requirement? Or is it sufficient to find the __first__ match? If you want to find the last match, your inner for-loop can be improved by doing For k2=50000 to 1 step -1. In either case, the inner loop can stop after finding a match. So you can put the following statement before End If: Exit For. Thus, on average a match is found after 25000 comparisons instead of 50000, assuming unique data in both Sheet1!A1:A50000 and Sheet2! A1:A50000. That cuts the total iterations in half, down to 1.25 billion. Second, in your example, Sheet1!A1:A50000 is in ascending order. If that is always the case, you could use a binary search instead of linear search. Thus, a match is found in at most 16 comparisions. That cuts the total iterations down to 800,000 at most, with some probability that it is much less. Finally, it is much more efficient to copy Sheet1!A1:B50000 and Sheet2! A1:A50000 into a Variant variable one time, and to retain the results in a Variant variable and copy them to Sheet2!B1:B50000 one time. The time savings can be huge. Every time you access an Excel range, VBA and Excel processes much exchange information -- two interprocess communications -- and the Excel process must be scheduled to run while the VBA process waits. Moreover, even if Sheet1!A1:A50000 is not in ascending order, it is probably advantageous to take the time to sort dataX below, if we implemented a binary search (not included below). So at a minimum (without the binary search), your VBA algorithm should be: Dim dataX, dataY dataX = Worksheets(1).Range("A1:B50000") dataY = Worksheets(2).Range("A1:A50000") redim resultY(1 to 50000, 1) for k1 = 1 to 50000: for k2 = 1 to 50000 if dataY(k1,1) = dataX(k2,1) then resultY(k1,2) = dataX(k2,2) exit for end if next: next Worksheets(2).Range("B1:B50000") = resultY (Note that I reversed the order of the for-loops. It would make a difference if you implement a binary search of Sheet1!A1:A50000.) Other improvements are possible and probably desirable; for example, avoiding the hardcoded ranges. That is why I use Redim instead of Dim for resultY. By reducing the interprocess communications (copying of data between VBA and Excel), I would be very surprised if that VBA implementation fails to outperform any of the alternatives mentioned so far: VBA Match/Index, .Find, and Excel COUNTIF/INDEX/MATCH. (Although the Excel alternative might be more efficient if IFERROR can be used instead of COUNTIF.) I might post relative performance data later, if I get around to it. |
#8
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Errata....
On May 22, 4:14*pm, I wrote: redim resultY(1 to 50000, 1) Should be: redim resultY(1 to 50000, 1 to 1) I wrote: resultY(k1,2) = dataX(k2,2) Should be: resultY(k1,1) = dataX(k2,2) I wrote: Every time you access an Excel range, VBA and Excel processes much exchange information -- two interprocess communications Should be: "must exchange" |
#9
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 22, 7:43*am, "Rick Rothstein"
put this formula in B1 on that other sheet and copy it down... =IF(COUNTIF(Sheet1!A$1:A$500,A1), MATCH(A1,Sheet1!A$1:A$500,0),"") Presumably you mean: =IF(COUNTIF(Sheet1!A$1:A$50000,A1), INDEX(Sheet1!B$1:B$50000,MATCH(A1,Sheet1!A$1:A$500 00,0)),"") Or =IF(COUNTIF(Sheet1!A$1:A$50000,A1), VLOOKUP(A1,Sheet1!A$1:B$50000,2,0),"") Since COUNTIF will always do 50000 comparisons, on average this does 75000 comparisons for each of 50000 entries in Sheet2!A1:A50000, a total of 3.75 billion comparisions versus 1.25 billion for Andrew's algorithm (corrected). Nonetheless, this __might__ be an improvement because it avoids the interprocess communication and process scheduling time between VBA and Excel. (But I would be surprised.) |
#10
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 22, 7:07*am, isabelle wrote:
Sub Macro1() Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer Set Wks1 = Worksheets(1) Set Wks2 = Worksheets(2) [....] Set Wks1 = Nothing Set Wks2 = Nothing End Sub Why set Wks1 and Wks2 to Nothing before exiting the macro? Since they are non-static local variables, I would expect VBA to effectively do the same thing automagically. Is there a memory leak in some revisions of VBA? |
#11
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
hi joeu2004,
Is there a memory leak ? yes there are in some revisions of VBA? it's valid for all versions of VBA -- isabelle |
#12
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
joeu2004 was thinking very hard :
On May 22, 7:07*am, isabelle wrote: Sub Macro1() Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer Set Wks1 = Worksheets(1) Set Wks2 = Worksheets(2) [....] Set Wks1 = Nothing Set Wks2 = Nothing End Sub Why set Wks1 and Wks2 to Nothing before exiting the macro? Since they are non-static local variables, I would expect VBA to effectively do the same thing automagically. Is there a memory leak in some revisions of VBA? My understanding of how this works is that it's just good programming practice to do this because even though the objects are destroyed when the procedure ends, the memory space that they occupied in the Stack still persists. I may be wrong, but this is how it was explained to me by the leading minds in Excel application development. -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#13
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 22, 4:14 pm, joeu2004 wrote:
Moreover, even if Sheet1!A1:A50000 is not in ascending order, it is probably advantageous to take the time to sort dataX below, if we implemented a binary search No! A selection sort requires nearly 1.250 billion (10^9) comparisions, overwhelming any benefit of the binary search. Excel sort can do that in a blink of an eye. But a VBA implementation takes "forever". Presumably VBA is significantly slower because it is interpreted, not compiled. I wrote: By reducing the interprocess communications (copying of data between VBA and Excel), I would be very surprised if that VBA implementation fails to outperform any of the alternatives mentioned so far: VBA Match/Index, .Find, and Excel COUNTIF/INDEX/MATCH. Color me "surprised"! When I made that comment, I forgot to account for the fact that VBA is interpreted, not compiled. Nonetheless, on my computer, a VBA binary search (50,000 times through 50,000 elements) is nearly 2.4 times faster than Isabelle's VBA Match algorithm (significantly improved and modified to use a binary search), and nearly 800 to 1200 times faster than Ron's VBA Find algorithm and Rick's Excel COUNTIF/MATCH formula (improved to use VLOOKUP and a binary search). However, Rick's Excel formula can be nearly 3.7 times faster than the VBA implementation if we use a helper column for the VLOOKUP (with a binary search) and IF(ISERROR) instead of COUNTIF. In XL2007 and later, I suspect we would get comparable results with IFERROR(VLOOKUP) without the need for a helper column. The improved VBA linear search is 1.8 to 2.2 times faster than Ron's VBA Find algorithm and Rick's Excel COUNTIF/MATCH formula (improved to use VLOOKUP, but a linear search). However, Isabelle's VBA Match algorithm (significantly improved, but a linear search) is nearly 5.8 times faster than the improved VBA linear search. Even Isabelle's obviously inefficient implementation (doing the Match twice) is nearly 2.9 times faster. And again, Rick's Excel formula can be nearly 6 times faster than the improved VBA linear search if we use a helper column for the VLOOKUP (with a linear search). YMMV significantly for a variety of reasons. Since my improved VBA linear search algorithm runs nearly 7 times longer than what Andrew claims ("45 seconds"), I suspect that my system is slow because of small memory, small cache or both relative to more modern systems. Also, my CPU is single-core and single-threaded (instruction "threads", not process threads). At first, I did not think that would be a factor, since VBA does not take advantage of multiple cores and instruction threads. However, Excel can, perhaps favoring Rick's approach. And I suspect it might improve Isabelle's and Ron's approaches for similar reasons; but that is based on speculation. |
#14
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
"Andrew" wrote in message This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. You've been given many suggestions, particularly about taking advantage of using Excel's built in functions. However which ever approach, with this type of thing it's really worthwhile to start with both arrays sorted, typically that's a one off task, or not much extra work when adding small amounts of data. Once both sorted, search for the first matching item (and further duplicates). Then search the second item starting at index+1 of the last item found. Obviously in any loop after finding a matching item, as soon as the next item doesn't match (ie after any duplicates) exit that loop and start the next. You should end up with a fraction of the amount of working you are doing now. If regularly trying to find individual items in a large sorted array, search "binary chop", which will find matches in negligible time even in large arrays. Peter Thornton |
#15
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
put this formula in B1 on that other sheet
and copy it down... =IF(COUNTIF(Sheet1!A$1:A$500,A1), MATCH(A1,Sheet1!A$1:A$500,0),"") Presumably you mean: =IF(COUNTIF(Sheet1!A$1:A$50000,A1), INDEX(Sheet1!B$1:B$50000,MATCH(A1,Sheet1!A$1:A$500 00,0)),"") Or =IF(COUNTIF(Sheet1!A$1:A$50000,A1), VLOOKUP(A1,Sheet1!A$1:B$50000,2,0),"") I can't believe I forgot to included it... I was going for the INDEX formula version. Thanks for catching that. Rick Rothstein (MVP - Excel) |
#16
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
I really like reading your messages... the amount of analysis you are
willing to perform, and then write up for us to read, is simply awe-inspiring. Thank you for always taking the time to do this. Rick Rothstein (MVP - Excel) |
#17
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 22, 7:35*pm, isabelle wrote:
[joeu2004 wrote:] Is there a memory leak ? yes there are in some revisions of VBA? it's valid for all versions of VBA I do not notice any such memory leak in XL2003 SP3 with VBA 6.5.1024. I have only Task Manager Mem Usage and VM Size statistics to look at. If there is a kernel function that returns the current process's memory usage, please let me know. I was unable to find one. I tried the module below. After the MsgBox appears, allow 1-2 seconds for Task Manager to update. With #Const globDef set to False (local ws array), both Mem Usage and VM Size increases by about 4 MB. After exiting the procedure, the Mem Usage and VM Size reverts to their previous values. Also note that setting the ws array to Nothing does __not__ reduce memory usage. Moreover, it is not even necessary to set the ws to Worksheets(1). The local declaration alone is sufficient to increase Mem Usage and VM Size. This is consistent with my understanding that the Dim statement alone causes the allocation of the memory for the ws array. (Aside.... I also tried changing Dim to ReDim, with the same results.) With #Const globDef set to True (non-local ws array), both Mem Usage and VM Size increases by about 4 MB the first time the procedure is executed. In this case, after exiting the procedure, Mem Usage and VM Size remains at that level, about 4 MB over their previous values. That is to be expected, since global storage __should__ persist between procedure calls. And again note that setting the ws array to Nothing does __not__ reduce memory usage. And again note that it is not even necessary to set ws to Worksheets(1). Mem Usage and VM Size does revert to their previous values when I remove the global ws array declaration, for example by setting #Const globDef to False. The memory reduction happens almost immediately, after waiting 1-2 seconds for Task Manager to update. (Aside.... I would expect a local Static declaration to behave the same way as a non-local declaration. But I did not test it. "The exercise is left to the student" ;-.) ----- #Const globDef = False Private Declare Function GetCurrentProcessId _ Lib "kernel32" () As Long #If globDef Then Dim ws(1 To 1000000) As Worksheet #End If Sub testit() #If Not globDef Then Dim ws(1 To 1000000) As Worksheet #End If For i = 1 To UBound(ws) Set ws(i) = Worksheets(1) Next MsgBox GetCurrentProcessId() & " set to Worksheets" For i = 1 To UBound(ws) Set ws(i) = Nothing Next MsgBox GetCurrentProcessId() & " set to Nothing" End Sub |
#18
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
<FWIW
Here's a snippet from Professional Excel Development Ch.07 that explains how memory leaks occur, and why it's considered *good practice* to set objects we're done with to *= Nothing*. <Quote "...Normally when you overwrite an object in VBA, VBA cleans up the old version of the object and reclaims the memory that was used to hold it. You can also set an object equal to Nothing to reclaim the memory used by it. It is good practice to do this explicitly when you no longer need an object, rather than relying on VBA to do it. Set gclsCells = Nothing When you create two objects that store references to each other the system will no longer reclaim the memory they used when they are set to new versions or when they are set to Nothing. When analyzing the worksheet in Analysis5.xls with 574 cells in the used range, there is a loss of about 250KB of RAM each time CreateCellsCollection is executed during an Excel session. NOTE If you are running Windows NT, 2000 or XP you can check the amount of RAM currently used by Excel by pressing Ctrl+Shift+Esc to display the Processes window in Task Manager and examining the Mem Usage column for the row where the Image Name column is EXCEL.EXE. One way to avoid this problem is to make sure you remove the cross references from the linked objects before the objects are removed. You can do this by adding a method like the Terminate method shown in Listing 7-15 to the problem classes, in our case the CCell class. Listing 7-15 The Terminate Method in the CCell Class Module Public Sub Terminate() Set mclsParent = Nothing End Sub The code in Listing 7-16 is added to the CCells class module. It calls the Terminate method of each Cell class contained in the collection to destroy the cross reference between the classes. Listing 7-16 The Terminate Method in the CCells Class Module Public Sub Terminate() Dim clsCell As CCell For Each clsCell In mcolCells clsCell.Terminate Set clsCell = Nothing Next clsCell Set mcolCells = Nothing End Sub..." </Quote -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#19
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Peter T pretended :
"Andrew" wrote in message This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. You've been given many suggestions, particularly about taking advantage of using Excel's built in functions. However which ever approach, with this type of thing it's really worthwhile to start with both arrays sorted, typically that's a one off task, or not much extra work when adding small amounts of data. Once both sorted, search for the first matching item (and further duplicates). Then search the second item starting at index+1 of the last item found. Obviously in any loop after finding a matching item, as soon as the next item doesn't match (ie after any duplicates) exit that loop and start the next. You should end up with a fraction of the amount of working you are doing now. If regularly trying to find individual items in a large sorted array, search "binary chop", which will find matches in negligible time even in large arrays. Peter Thornton Peter, I think the time it takes lies in the fact that *Arrays* aren't being compared, but rather the loops read the ranges directly. If the 2 ranges were dumped into arrays where at least 1 array was sorted, a binary search would cut the processing time considerably. Better, though, (as you say) that both arrays are sorted. The bottleneck is trying to loop the ranges directly while doing this in memory would run orders of magnitude faster.<IMO -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#20
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 23, 5:42*pm, GS wrote:
Peter T pretended : "Andrew" wrote in message This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. You've been given many suggestions, particularly about taking advantage of using Excel's built in functions. However which ever approach, with this type of thing it's really worthwhile to start with both arrays sorted, typically that's a one off task, or not much extra work when adding small amounts of data. Once both sorted, search for the first matching item (and further duplicates). Then search the second item starting at index+1 of the last item found. Obviously in any loop after finding a matching item, as soon as the next item doesn't match (ie after any duplicates) exit that loop and start the next. You should end up with a fraction of the amount of working you are doing now. If regularly trying to find individual items in a large sorted array, search "binary chop", which will find matches in negligible time even in large arrays. Peter Thornton Peter, I think the time it takes lies in the fact that *Arrays* aren't being compared, but rather the loops read the ranges directly. If the 2 ranges were dumped into arrays where at least 1 array was sorted, a binary search would cut the processing time considerably. Better, though, (as you say) that both arrays are sorted. The bottleneck is trying to loop the ranges directly while doing this in memory would run orders of magnitude faster.<IMO -- Garry Free usenet access athttp://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc There's some very good information in this thread. Thank you. I have yet to try these find and match methods. Although I did use VLOOKUP for a search and reduced the search time (for the same 50,000 rows of data) down to 3 seconds. In using VLOOKUP I used named ranges for the first two input arguments, the search parameter and the lookup table. Then I tried to do the same thing using arrays instead of ranges. It took at least one hundred times longer. I would think that it would be faster to sort through an array than a range, but clearly this is not true. thanks again, Andrew |
#21
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
After serious thinking Andrew wrote :
I would think that it would be faster to sort through an array than a range, but clearly this is not true. Well, that depends on how you do it. Clearly there are fast ways and slow ways, and so if it took as long as you say it did then I suspect you used a slow way. For sure, if at least one of your arrays wasn't sorted, it would take a significant amount of time to process. Far less time if 1 array is sorted. Orders of magnitude less time if both arrays are sorted. Regardless, built-in Excel functions will almost always perform much faster than VBA will and so if Rick's suggestion works for you I'd be inclined to use it over VBA. -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#22
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
"GS" wrote in message Peter T pretended : "Andrew" wrote in message This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say that each iteration requires 20 CPU clock cycles, that's 50 billion clock cycles. You've been given many suggestions, particularly about taking advantage of using Excel's built in functions. However which ever approach, with this type of thing it's really worthwhile to start with both arrays sorted, typically that's a one off task, or not much extra work when adding small amounts of data. Once both sorted, search for the first matching item (and further duplicates). Then search the second item starting at index+1 of the last item found. Obviously in any loop after finding a matching item, as soon as the next item doesn't match (ie after any duplicates) exit that loop and start the next. You should end up with a fraction of the amount of working you are doing now. If regularly trying to find individual items in a large sorted array, search "binary chop", which will find matches in negligible time even in large arrays. Peter Thornton Peter, I think the time it takes lies in the fact that *Arrays* aren't being compared, but rather the loops read the ranges directly. If the 2 ranges were dumped into arrays where at least 1 array was sorted, a binary search would cut the processing time considerably. Better, though, (as you say) that both arrays are sorted. The bottleneck is trying to loop the ranges directly while doing this in memory would run orders of magnitude faster.<IMO Garry, Indeed, reading cells is considerably longer than reading array values, though there will be some overhead initially reading 2x50k cells to a array or pair of arrays. It really is important to sort the first array, whether using built-in Excel functions or VBA. Also if using VBA extremely worthwhile to sort the 2nd array. That's because for each cell in the 2nd array, code will only need to look from the index+1 of the last match to the first item that doesn't match, so much less work than available to Excel's built-in functions. Despite Excel's normally faster methods, it's difficult to take advantage of that approach in Excel (although could be done) so VBA might be easier/faster. As for sorting, as you say in your adjacent post "that depends on how you do it". There are some extremely fast sort methods out there for VBA, some compare well if not better than Excel's built-in method (excluding the time to transfer from cells to array to cells), particularly if the data is all say doubles. However to keep things simple, maybe get VBA to leverage Excel's sort, then compare the sorted data in VBA as above. Without testing not sure which'd be fastest, VBA+Excel or just Excel. Peter T |
#23
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
"GS" wrote in message ... <FWIW Here's a snippet from Professional Excel Development Ch.07 that explains how memory leaks occur, and why it's considered *good practice* to set objects we're done with to *= Nothing*. <Quote "...Normally when you overwrite an object in VBA, VBA cleans up the old version of the object and reclaims the memory that was used to hold it. You can also set an object equal to Nothing to reclaim the memory used by it. It is good practice to do this explicitly when you no longer need an object, rather than relying on VBA to do it. Set gclsCells = Nothing Not sure I entirely agree, or rather it depends. There may be good reasons to do that for other objectives, eg done with the object and want to test if Nothing later. Also probably a good idea to clean up globals as you go when done. However simply good practice for the sake of being sure the object reference is destroyed, perhaps not. Ideally there should never be any doubt about when and where the reference will looses scope. VB/A's builtin garbage collection works very efficiently, and explicitly setting to Nothing is unnecessary work. Regards, Peter T |
#24
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Peter T wrote on 5/24/2011 :
However to keep things simple, maybe get VBA to leverage Excel's sort, then compare the sorted data in VBA as above. Without testing not sure which'd be fastest, VBA+Excel or just Excel. Peter, This is what I usually do with large arrays as I find Excel's Sort much faster than any array sort. I simply dump the array into a temp wks, sort the data, dump the data back into the array, and delete the wks after I'm done with it. Works a treat. Otherwise, I use VB[A] methods as described in PED ch17 for sorting/matching in arrays. -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#25
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 23, 2:01*pm, GS wrote:
Here's a snippet from Professional Excel Development Ch.07 that explains how memory leaks occur, and why it's considered *good practice* to set objects we're done with to *= Nothing*. That's a matter of opinion, just as some people might say it is "good practice" to explicitly initialize non-static local variables to zero in a procedure instead of relying on the fact they are initialized to zero each time a procedure is entered. The development guide describes a complex situation where, I would agree, it is necessary to set an object variable to Nothing, namely objects that reference other objects. But I was asking about a very limited case, namely: setting the object variable to Nothing just before exiting a procedure, when the object variable is a non-static local variable. In that case, VBA help explains: "When several object variables refer to the same object, memory and system resources associated with the object to which the variables refer are released only after all of them have been set to Nothing, either explicitly using Set, __or_implicitly__ after the last object variable set to Nothing goes out of scope." Exiting a procedure is an example of non-static local variable going out of scope. In that case, VBA should implicitly set the object variable to Nothing, decreasing the object reference count. If that causes the reference count to go to zero, the system should free the memory associated with it. When Isabelle mentioned a "memory leak", I was thinking of a systemic memory leak (defect in VBA), not a program memory leak (programming error). |
#26
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 23, 11:56*am, joeu2004 wrote:
I do not notice any such memory leak in XL2003 SP3 with VBA 6.5.1024. [....] I tried the module below. Not good enough. That's not the memory leak we are concerned with. |
#27
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
Excellent example of the snippet I posted earlier!
I must re-iterate Rick's compliment to you about the effort and research you do to provide such detailed replies. Most awesome! I hope the casual posters appreciate those details as much as some of us regulars do. Personally, I was satisfied with the first sentence of your earlier post: "That's a matter of opinion..." Some people may just decide to make it their personal rule that any objects they create will be set "= Nothing" when done with so they are covered for all situations, *regardless of context*. That, of course, is entirely their choice.<g -- Garry Free usenet access at http://www.eternal-september.org ClassicVB Users Regroup! comp.lang.basic.visual.misc |
#28
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
On May 25, 2:24 pm, "Peter T" wrote:
I haven't fully followed this thread so apologies in advance if I've missed something, however I don't follow how this example relates to Isabelle's wanting to explicitly destroy worksheet references. You may be right. In a last-minute effort to simplify things, I removed optional code that I believe would have related to Isabelle's example better. The VBA help page states: "memory and system resources associated with the object to which the variables refer are released only after all of them have been set to Nothing, either explicitly using Set, or implicitly after the last object variable set to Nothing goes out of scope." I believe that says that non-static local object variables are implicitly set to Nothing when a procedure is exited. The better demonstration of that fact would be if in testit1st, I set the global object variable "a" to Nothing explicitly some time after returning from testit2, and use Task Manager to show that the memory is freed. That could only happen if the reference count for the object went to zero at that point, which could only happen if the testit2 reference were implicitly decreased when leaving the testit2 scope by implicitly setting the local object variable "a" to Nothing. In fact, I had done exactly that in one version of the example. "The exercise is left to the student". Not a very good example, I agree. I could clean all this up. But I think I'm beating a dead horse at this point. You wrote: FWIW issues can arise with classes and circular references (classes refer to each other), then they can be difficult if not impossible to tear down and release memory, but that's a different matter. Yes it is. Nonetheless, I had intended to demonstrate that very situation with a separate example anyway, primarily for GS's benefit. I forgot. You wrote: if the reference was to a sheet in an automated instance, failure to release could leave the Excel instance hanging in memory. Can you explain that with an example? I do not understand the terminology "automated instance" and how that can lead to a memory leak. One example that I believe would "leave the Excel instance hanging in memory" is if we Set a global object variable to a Worksheet or Range object and neglect to Set the global object variable to Nothing. That was really the point of the last step in my testit1st macro, which goes beyond the question I had for Isabelle. It is certainly the programmer's responsibility to know when it is necessary to Set a global object variable to Nothing. It depends on the program design. But that is not the kind of memory leak I thought Isabelle is referring to. Since the VBA help documentation explains that we can expect that exiting from the scope effectively sets non-static local object variables to Nothing (i.e. decrease the object reference count), I thought Isabelle might be referring to a __VBA__ defect, not a programming error. In any case, I believe that you and I are in "violent agreement" not only on when we __must__ use Set=Nothing, but also when it is really not necessary, notwithstanding some people's idea of "good practice". |
#29
Posted to microsoft.public.excel.programming
|
|||
|
|||
how to find union of two arrays
"joeu2004" wrote in message You wrote: if the reference was to a sheet in an automated instance, failure to release could leave the Excel instance hanging in memory. Can you explain that with an example? I do not understand the terminology "automated instance" and how that can lead to a memory leak. "Automation" is where code (in particular object variables) in one app refers to another application, whether after creating a new instance or grabbing an existing one. All code in say VB6 doing anything with Excel is "Automation", and any instance of Excel so referenced is an "automated instance". One example that I believe would "leave the Excel instance hanging in memory" is if we Set a global object variable to a Worksheet or Range object and neglect to Set the global object variable to Nothing. *Only* if the reference is to an automated instance, eg code in Excel1 refers to Excel2, otherwise no problem (good practice is another matter). Try this - Private mWs As Worksheet Sub Test1() Dim xl As Excel.Application Dim wb As Workbook Set xl = New Excel.Application Set wb = xl.Workbooks.Add Set mWs = wb.Worksheets(1) xl.Quit End Sub Sub CleanUp() Set mWs = Nothing End Sub Look at Processes in Task Manager (maybe sort "Image Name"), step through Test1 and see the 2nd Excel instance appear, and remain after doing xl.Quit. Step through CleanUp and see the 2nd instance removed as mWs is destroyed. Actually before doing CleanUp stop or break on 'end sub' and look at mWs in Locals. Now try this - Sub Test2() Dim wb As Workbook Set wb = Workbooks.Add Set mWs = wb.Worksheets(1) wb.Close False End Sub See the difference with mWs this time in Locals. mWs retains the now redundant pointer but that's all, no memory leak. However it's not good practice, eg "mWs = Not Nothing" would give a misleading result. Of course in this example no need to explicity release 'wb'. In any case, I believe that you and I are in "violent agreement" not only on when we __must__ use Set=Nothing, but also when it is really not necessary, notwithstanding some people's idea of "good practice". Violent, ?, -:) But sure, I think we said pretty much the same in response to Garry's quote from the book. Regards, Peter T |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Find Min and Max in variable length arrays | Excel Programming | |||
Compare two arrays in VBA to find dupes | Excel Programming | |||
Arrays - declaration, adding values to arrays and calculation | Excel Programming | |||
UNION of Arrays - is possible? | Excel Discussion (Misc queries) | |||
Where can I find a template for a Credit Union account? | Excel Discussion (Misc queries) |