Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Hi -
I'm performing a shopping cart analysis (if customers purchase product A what else do they also purchase?) and have run into a bit of a snag ... I need a procedure that will return the most frequently occurring text string found in a contiguous block of 4784 cells (A1:AT104). There are over 10,000 possible values that could be in these cells, so counting the occurrence of ALL of them wouldn't be that efficient ... Additionally, a significant number of the cells in the target block could contain the value 'end' or '#N/A' -- these values need to be ignored. Ideally, the solution would be scalable (allow users to select larger or smaller range) and also allow for choosing the Most frequent, 2nd most frequent, 3rd most frequent, etc values. So, something like: =UDFname(range,3) would return the 3rd most frequently occurring text string in the specified range. Any ideas of how to do such a thing? The only solutions I could find required either knowing what string value to look for or returning a list of values (sorted one way or another). Thanks for looking! //ray |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Hi Ray,
Ideally, the solution would be scalable (allow users to select larger or smaller range) and also allow for choosing the Most frequent, 2nd most frequent, 3rd most frequent, etc values. So, something like: =UDFname(range,3) would return the 3rd most frequently occurring text string in the specified range. A bit convoluted, but it seems to work: Option Explicit Public Function CountTextLarge(RangeWithWords As Range, Optional lIndex As Long = 1) Dim vWords() As Variant Dim vAllWords() As Variant Dim oRng As Range Dim sTest As String Dim lCt As Long Dim lAllCt As Long Dim lWordCounts() As Long Dim lCurIndex As Long ReDim vWords(1 To 1) ReDim vAllWords(1 To 1) Dim lCurCt As Long On Error Resume Next For Each oRng In RangeWithWords.Cells lAllCt = lAllCt + 1 ReDim Preserve vAllWords(1 To lAllCt) vAllWords(lAllCt) = oRng.Value2 If Not IsIn(vWords, oRng.Value2) Then lCt = lCt + 1 ReDim Preserve vWords(1 To lCt) vWords(lCt) = oRng.Value2 End If Next ReDim lWordCounts(1 To UBound(vWords)) For lCt = 1 To UBound(vWords) For lAllCt = 1 To UBound(vAllWords) If vAllWords(lAllCt) = vWords(lCt) Then lWordCounts(lCt) = lWordCounts(lCt) + 1 End If Next Next CountTextLarge = Application.Large(lWordCounts, lIndex) For lCt = 1 To UBound(lWordCounts) If CountTextLarge = lWordCounts(lCt) Then CountTextLarge = vWords(lCt) Exit Function End If Next End Function Function IsIn(vCol As Variant, vVal As Variant) As Boolean Dim lCt As Long On Error Resume Next For lCt = LBound(vCol) To UBound(vCol) If vCol(lCt) = vVal Then IsIn = True Exit Function End If Next End Function Call like this: =CountTextLarge(A2:A100,2) Regards, Jan Karel Pieterse Excel MVP http://www.jkp-ads.com Member of: Professional Office Developer Association www.proofficedev.com |
#3
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Thanks VERY much ... works exactly as I described! However, in
testing it out, I ran across a wrinkle that I probably should've recognized in the first place ... it's possible that the 'max frequency' will be met by several items! For example, if the max frequency is 10, there may be multiple items that each appear 10 times ... As written, in this situation the code returns the last item alphabetically -- doesn't appear to be written into the code that way, just an 'unintended consequence'. I think a couple of more UDFs are needed: ** One to return the max frequency (eg, 10) ** One to return how many items match the max frequency (eg, 5 ... so 5 items occur 10 times each) ** then a way to return each of the items that match the max frequency (eg, listing each of the 5 items) Currently, if I write: =CountTextLarge(A1:F100,1) I get the last item (alphabetically) that occurs the most If I follow it up with =CountTextLarge(A1:F100,2) I'd actually get the SIXTH item (or, even worse, the last item alphabetically with the 2nd largest frequency). Hopefully, I've fully & clearly explained the new 'road-block' ... any ideas? thanks, ray |
#4
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Hi Ray,
I ran across a wrinkle that I probably should've recognized in the first place ... it's possible that the 'max frequency' will be met by several items! For example, if the max frequency is 10, there may be multiple items that each appear 10 times ... Ah, indeed. Try this version instead: Option Explicit Public Function CountTextLarge(RangeWithWords As Range, Optional lIndex As Long = 1) Dim vWords() As Variant Dim vAllWords() As Variant Dim oRng As Range Dim sTest As String Dim lCt As Long Dim lAllCt As Long Dim lWordCounts() As Long Dim lCurIndex As Long Dim lLarger As Long ReDim vWords(1 To 1) ReDim vAllWords(1 To 1) Dim lCurCt As Long On Error Resume Next For Each oRng In RangeWithWords.Cells lAllCt = lAllCt + 1 ReDim Preserve vAllWords(1 To lAllCt) vAllWords(lAllCt) = oRng.Value2 If Not IsIn(vWords, oRng.Value2) Then lCt = lCt + 1 ReDim Preserve vWords(1 To lCt) vWords(lCt) = oRng.Value2 End If Next ReDim lWordCounts(1 To UBound(vWords)) For lCt = 1 To UBound(vWords) For lAllCt = 1 To UBound(vAllWords) If vAllWords(lAllCt) = vWords(lCt) Then lWordCounts(lCt) = lWordCounts(lCt) + 1 End If Next Next CountTextLarge = Application.Large(lWordCounts, lIndex) Debug.Print lIndex Sort2Arrays lWordCounts, vWords, LBound(lWordCounts), UBound(lWordCounts), False For lCt = 1 To UBound(lWordCounts) If CountTextLarge <= lWordCounts(lCt) Then lLarger = lLarger + 1 End If If CountTextLarge = lWordCounts(lCt) And lLarger = lIndex Then CountTextLarge = vWords(lCt) Exit Function End If Next End Function Function IsIn(vCol As Variant, vVal As Variant) As Boolean Dim lCt As Long On Error Resume Next For lCt = LBound(vCol) To UBound(vCol) If vCol(lCt) = vVal Then IsIn = True Exit Function End If Next End Function Sub Sort2Arrays(strArray As Variant, strArray2 As Variant, intBottom As Long, intTop As Long, bAtoZ As Boolean) Dim strPivot As String Dim vTemp As Variant Dim intBottomTemp As Long Dim intTopTemp As Long Debug.Print "Sort2" intBottomTemp = intBottom intTopTemp = intTop strPivot = strArray((intBottom + intTop) \ 2) While (intBottomTemp <= intTopTemp) If bAtoZ Then While (strArray(intBottomTemp) < strPivot And intBottomTemp < intTop) intBottomTemp = intBottomTemp + 1 Wend Else While (strArray(intBottomTemp) strPivot And intBottomTemp < intTop) intBottomTemp = intBottomTemp + 1 Wend End If If bAtoZ Then While (strPivot < strArray(intTopTemp) And intTopTemp intBottom) intTopTemp = intTopTemp - 1 Wend Else While (strPivot strArray(intTopTemp) And intTopTemp intBottom) intTopTemp = intTopTemp - 1 Wend End If If intBottomTemp < intTopTemp Then vTemp = strArray(intBottomTemp) strArray(intBottomTemp) = strArray(intTopTemp) strArray(intTopTemp) = vTemp vTemp = strArray2(intBottomTemp) strArray2(intBottomTemp) = strArray2(intTopTemp) strArray2(intTopTemp) = vTemp End If If intBottomTemp <= intTopTemp Then intBottomTemp = intBottomTemp + 1 intTopTemp = intTopTemp - 1 End If Wend 'the function calls itself until everything is in good order If (intBottom < intTopTemp) Then Sort2Arrays strArray, strArray2, intBottom, intTopTemp, bAtoZ If (intBottomTemp < intTop) Then Sort2Arrays strArray, strArray2, intBottomTemp, intTop, bAtoZ End Sub Regards, Jan Karel Pieterse Excel MVP http://www.jkp-ads.com Member of: Professional Office Developer Association www.proofficedev.com |
#5
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
On Thu, 5 Nov 2009 03:57:32 -0800 (PST), Ray wrote:
Hi - I'm performing a shopping cart analysis (if customers purchase product A what else do they also purchase?) and have run into a bit of a snag ... I need a procedure that will return the most frequently occurring text string found in a contiguous block of 4784 cells (A1:AT104). There are over 10,000 possible values that could be in these cells, so counting the occurrence of ALL of them wouldn't be that efficient ... Additionally, a significant number of the cells in the target block could contain the value 'end' or '#N/A' -- these values need to be ignored. Ideally, the solution would be scalable (allow users to select larger or smaller range) and also allow for choosing the Most frequent, 2nd most frequent, 3rd most frequent, etc values. So, something like: =UDFname(range,3) would return the 3rd most frequently occurring text string in the specified range. Any ideas of how to do such a thing? The only solutions I could find required either knowing what string value to look for or returning a list of values (sorted one way or another). Thanks for looking! //ray Here's one suggestion. The routine creates a list of unique entries, and the frequency of each entry. It then computes the nth most frequent entry and returns all of the strings that match. (So if different strings have the same frequency, it will return all of them) If there are multiple matches, it returns the results in an array and you can get at them using an Index function. This works similar to the LARGE function in that if the frequency of two items is the same, they will take up "two slots". In other words, if the avocados and peaches both are the most frequent, with 10 occurrences, the will both be returned in the event Index = 1 or Index = 2. If this is not what you want, we could add some sorting to the algorithm and make it return unique values. ================================================== ============ Function LargeStrings(rg As Range, Optional Index As Long = 1) As Variant Dim cStrings As Collection Dim c As Range Dim temp(), tempResults() Dim i As Long, j As Long 'get list of unique strings Set cStrings = New Collection On Error Resume Next For Each c In rg If Not (c.Text = "end" Or c.Text = "#NA") Then cStrings.Add Item:=c.Value, Key:=c.Value End If Next c On Error GoTo 0 If Index cStrings.Count Then LargeStrings = CVErr(xlErrNum) Exit Function End If 'put into array with count ReDim temp(0 To 1, 0 To cStrings.Count - 1) ReDim tempCount(0 To cStrings.Count - 1) ReDim tempItems(0 To cStrings.Count - 1) For i = 1 To cStrings.Count temp(0, i - 1) = cStrings(i) temp(1, i - 1) = WorksheetFunction.CountIf(rg, cStrings(i)) Next i 'get desired index j = WorksheetFunction.Large(temp, Index) 'put into array ReDim tempResults(0) For i = 0 To UBound(temp, 2) If temp(1, i) = j Then tempResults(UBound(tempResults)) = temp(0, i) ReDim Preserve tempResults(UBound(tempResults) + 1) End If Next i ReDim Preserve tempResults(UBound(tempResults) - 1) If UBound(tempResults) = 0 Then LargeStrings = tempResults(0) Else LargeStrings = tempResults End If End Function =========================================== --ron |
#6
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Again, thanks very much for your time to respond -- both suggestions
appear quite involved, and I appreciate your efforts! Jan Karel - your revised solution works well, esp when pared with CountIf (in another cell) to reflect actual frequency ... Ron, from the description, your solution sounds intriguing and quite useful but I'm not sure how to implement/call the procedure. The coding looks like a UDF, but this part of the description ******************* The routine creates a list of unique entries, and the frequency of each entry. It then computes the nth most frequent entry and returns all of the strings that match. (So if different strings have the same frequency, it will return all of them) ******************* seems to indicate that the coding would return a range of items (to multiple cells), based on the user-provided index number. Could you please clarify the intended implementation of your coding? Thanks again ... ray |
#7
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
Hi Ray,
Jan Karel - your revised solution works well, esp when pared with CountIf (in another cell) to reflect actual frequency ... Excellent. Glad I could help. Regards, Jan Karel Pieterse Excel MVP http://www.jkp-ads.com Member of: Professional Office Developer Association www.proofficedev.com |
#8
Posted to microsoft.public.excel.programming
|
|||
|
|||
Detect String Frequency on dynamic entries
On Fri, 6 Nov 2009 08:27:03 -0800 (PST), Ray wrote:
Again, thanks very much for your time to respond -- both suggestions appear quite involved, and I appreciate your efforts! Jan Karel - your revised solution works well, esp when pared with CountIf (in another cell) to reflect actual frequency ... Ron, from the description, your solution sounds intriguing and quite useful but I'm not sure how to implement/call the procedure. The coding looks like a UDF, but this part of the description ******************* The routine creates a list of unique entries, and the frequency of each entry. It then computes the nth most frequent entry and returns all of the strings that match. (So if different strings have the same frequency, it will return all of them) ******************* seems to indicate that the coding would return a range of items (to multiple cells), based on the user-provided index number. Could you please clarify the intended implementation of your coding? Thanks again ... ray The function returns an array of values. For example, if you look at the LINEST worksheet function in Excel, you can see that it returns an array of values. In order to display the results, you would "array-enter" the formula over multiple cells (i.e. with <ctrl-shift-enter). Or you could use the INDEX function to return each element of the array. For example, given these entries in various cells in the range A1:I18 ================================================== ========== rugs shampoo shampoo potatotes end #N/A string beans potatoes potatoes potatoes string beans potatoes holes holes ================================================== ================ If you enter the formula: =LargeStrings(A1:I18,2) you would see displayed "shampoo" However, it is really returning a horizontal array: {"shampoo","string beans","holes"} since all three have the 2nd largest frequency. To "see" these you could either enter this as an array formula in at least three cells (and if you wanted to see them in a vertical array you could transpose the formula); or you could use the INDEX function, e.g.: So if you wanted to display the results in a vertical array, you could enter a formula like: =IF(COUNTA(LargeStrings($A$1:$I$18,2))=ROWS($1:1) ,INDEX(LargeStrings($A$1:$I$18,2),ROWS($1:1)),"") in some cell and fill down until you were returning blanks. In the above example, if that formula were entered in A20 and you filled down, you would see: A20: shampoo A21: string beans A22: holes --ron |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Counting the frequency of a text string | Excel Discussion (Misc queries) | |||
String Frequency | Excel Programming | |||
Frequency of a word string in an excel database | Excel Discussion (Misc queries) | |||
how to Detect paragraph in cell in string? | Excel Programming | |||
frequency of characters within a string | Excel Programming |