Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming
Ray Ray is offline
external usenet poster
 
Posts: 267
Default Detect String Frequency on dynamic entries

Hi -

I'm performing a shopping cart analysis (if customers purchase product
A what else do they also purchase?) and have run into a bit of a
snag ...

I need a procedure that will return the most frequently occurring text
string found in a contiguous block of 4784 cells (A1:AT104). There
are over 10,000 possible values that could be in these cells, so
counting the occurrence of ALL of them wouldn't be that efficient ...

Additionally, a significant number of the cells in the target block
could contain the value 'end' or '#N/A' -- these values need to be
ignored.

Ideally, the solution would be scalable (allow users to select larger
or smaller range) and also allow for choosing the Most frequent, 2nd
most frequent, 3rd most frequent, etc values. So, something like:
=UDFname(range,3) would return the 3rd most frequently occurring text
string in the specified range.

Any ideas of how to do such a thing? The only solutions I could find
required either knowing what string value to look for or returning a
list of values (sorted one way or another).

Thanks for looking!

//ray
  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 535
Default Detect String Frequency on dynamic entries

Hi Ray,

Ideally, the solution would be scalable (allow users to select larger
or smaller range) and also allow for choosing the Most frequent, 2nd
most frequent, 3rd most frequent, etc values. So, something like:
=UDFname(range,3) would return the 3rd most frequently occurring text
string in the specified range.


A bit convoluted, but it seems to work:

Option Explicit

Public Function CountTextLarge(RangeWithWords As Range, Optional lIndex As Long
= 1)
Dim vWords() As Variant
Dim vAllWords() As Variant
Dim oRng As Range
Dim sTest As String
Dim lCt As Long
Dim lAllCt As Long
Dim lWordCounts() As Long
Dim lCurIndex As Long

ReDim vWords(1 To 1)
ReDim vAllWords(1 To 1)
Dim lCurCt As Long
On Error Resume Next
For Each oRng In RangeWithWords.Cells
lAllCt = lAllCt + 1
ReDim Preserve vAllWords(1 To lAllCt)
vAllWords(lAllCt) = oRng.Value2
If Not IsIn(vWords, oRng.Value2) Then
lCt = lCt + 1
ReDim Preserve vWords(1 To lCt)
vWords(lCt) = oRng.Value2
End If
Next
ReDim lWordCounts(1 To UBound(vWords))
For lCt = 1 To UBound(vWords)
For lAllCt = 1 To UBound(vAllWords)
If vAllWords(lAllCt) = vWords(lCt) Then
lWordCounts(lCt) = lWordCounts(lCt) + 1
End If
Next
Next
CountTextLarge = Application.Large(lWordCounts, lIndex)
For lCt = 1 To UBound(lWordCounts)
If CountTextLarge = lWordCounts(lCt) Then
CountTextLarge = vWords(lCt)
Exit Function
End If
Next
End Function

Function IsIn(vCol As Variant, vVal As Variant) As Boolean
Dim lCt As Long
On Error Resume Next
For lCt = LBound(vCol) To UBound(vCol)
If vCol(lCt) = vVal Then
IsIn = True
Exit Function
End If
Next
End Function

Call like this:

=CountTextLarge(A2:A100,2)

Regards,

Jan Karel Pieterse
Excel MVP
http://www.jkp-ads.com
Member of:
Professional Office Developer Association
www.proofficedev.com

  #3   Report Post  
Posted to microsoft.public.excel.programming
Ray Ray is offline
external usenet poster
 
Posts: 267
Default Detect String Frequency on dynamic entries

Thanks VERY much ... works exactly as I described! However, in
testing it out, I ran across a wrinkle that I probably should've
recognized in the first place ... it's possible that the 'max
frequency' will be met by several items! For example, if the max
frequency is 10, there may be multiple items that each appear 10
times ...

As written, in this situation the code returns the last item
alphabetically -- doesn't appear to be written into the code that way,
just an 'unintended consequence'. I think a couple of more UDFs are
needed:
** One to return the max frequency (eg, 10)
** One to return how many items match the max frequency (eg, 5 ... so
5 items occur 10 times each)
** then a way to return each of the items that match the max frequency
(eg, listing each of the 5 items)

Currently, if I write:
=CountTextLarge(A1:F100,1)
I get the last item (alphabetically) that occurs the most

If I follow it up with
=CountTextLarge(A1:F100,2)
I'd actually get the SIXTH item (or, even worse, the last item
alphabetically with the 2nd largest frequency).

Hopefully, I've fully & clearly explained the new 'road-block' ... any
ideas?

thanks, ray

  #4   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 535
Default Detect String Frequency on dynamic entries

Hi Ray,

I ran across a wrinkle that I probably should've
recognized in the first place ... it's possible that the 'max
frequency' will be met by several items! For example, if the max
frequency is 10, there may be multiple items that each appear 10
times ...


Ah, indeed. Try this version instead:

Option Explicit

Public Function CountTextLarge(RangeWithWords As Range, Optional lIndex As Long
= 1)
Dim vWords() As Variant
Dim vAllWords() As Variant
Dim oRng As Range
Dim sTest As String
Dim lCt As Long
Dim lAllCt As Long
Dim lWordCounts() As Long
Dim lCurIndex As Long
Dim lLarger As Long
ReDim vWords(1 To 1)
ReDim vAllWords(1 To 1)
Dim lCurCt As Long
On Error Resume Next
For Each oRng In RangeWithWords.Cells
lAllCt = lAllCt + 1
ReDim Preserve vAllWords(1 To lAllCt)
vAllWords(lAllCt) = oRng.Value2
If Not IsIn(vWords, oRng.Value2) Then
lCt = lCt + 1
ReDim Preserve vWords(1 To lCt)
vWords(lCt) = oRng.Value2
End If
Next
ReDim lWordCounts(1 To UBound(vWords))
For lCt = 1 To UBound(vWords)
For lAllCt = 1 To UBound(vAllWords)
If vAllWords(lAllCt) = vWords(lCt) Then
lWordCounts(lCt) = lWordCounts(lCt) + 1
End If
Next
Next

CountTextLarge = Application.Large(lWordCounts, lIndex)
Debug.Print lIndex
Sort2Arrays lWordCounts, vWords, LBound(lWordCounts), UBound(lWordCounts),
False
For lCt = 1 To UBound(lWordCounts)
If CountTextLarge <= lWordCounts(lCt) Then
lLarger = lLarger + 1
End If
If CountTextLarge = lWordCounts(lCt) And lLarger = lIndex Then
CountTextLarge = vWords(lCt)
Exit Function
End If
Next
End Function

Function IsIn(vCol As Variant, vVal As Variant) As Boolean
Dim lCt As Long
On Error Resume Next
For lCt = LBound(vCol) To UBound(vCol)
If vCol(lCt) = vVal Then
IsIn = True
Exit Function
End If
Next
End Function

Sub Sort2Arrays(strArray As Variant, strArray2 As Variant, intBottom As Long,
intTop As Long, bAtoZ As Boolean)
Dim strPivot As String
Dim vTemp As Variant
Dim intBottomTemp As Long
Dim intTopTemp As Long
Debug.Print "Sort2"
intBottomTemp = intBottom
intTopTemp = intTop
strPivot = strArray((intBottom + intTop) \ 2)
While (intBottomTemp <= intTopTemp)
If bAtoZ Then
While (strArray(intBottomTemp) < strPivot And intBottomTemp <
intTop)
intBottomTemp = intBottomTemp + 1
Wend
Else
While (strArray(intBottomTemp) strPivot And intBottomTemp <
intTop)
intBottomTemp = intBottomTemp + 1
Wend
End If
If bAtoZ Then
While (strPivot < strArray(intTopTemp) And intTopTemp intBottom)
intTopTemp = intTopTemp - 1
Wend
Else
While (strPivot strArray(intTopTemp) And intTopTemp intBottom)
intTopTemp = intTopTemp - 1
Wend
End If
If intBottomTemp < intTopTemp Then
vTemp = strArray(intBottomTemp)
strArray(intBottomTemp) = strArray(intTopTemp)
strArray(intTopTemp) = vTemp
vTemp = strArray2(intBottomTemp)
strArray2(intBottomTemp) = strArray2(intTopTemp)
strArray2(intTopTemp) = vTemp
End If
If intBottomTemp <= intTopTemp Then
intBottomTemp = intBottomTemp + 1
intTopTemp = intTopTemp - 1
End If
Wend
'the function calls itself until everything is in good order
If (intBottom < intTopTemp) Then Sort2Arrays strArray, strArray2,
intBottom, intTopTemp, bAtoZ
If (intBottomTemp < intTop) Then Sort2Arrays strArray, strArray2,
intBottomTemp, intTop, bAtoZ
End Sub

Regards,

Jan Karel Pieterse
Excel MVP
http://www.jkp-ads.com
Member of:
Professional Office Developer Association
www.proofficedev.com

  #5   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default Detect String Frequency on dynamic entries

On Thu, 5 Nov 2009 03:57:32 -0800 (PST), Ray wrote:

Hi -

I'm performing a shopping cart analysis (if customers purchase product
A what else do they also purchase?) and have run into a bit of a
snag ...

I need a procedure that will return the most frequently occurring text
string found in a contiguous block of 4784 cells (A1:AT104). There
are over 10,000 possible values that could be in these cells, so
counting the occurrence of ALL of them wouldn't be that efficient ...

Additionally, a significant number of the cells in the target block
could contain the value 'end' or '#N/A' -- these values need to be
ignored.

Ideally, the solution would be scalable (allow users to select larger
or smaller range) and also allow for choosing the Most frequent, 2nd
most frequent, 3rd most frequent, etc values. So, something like:
=UDFname(range,3) would return the 3rd most frequently occurring text
string in the specified range.

Any ideas of how to do such a thing? The only solutions I could find
required either knowing what string value to look for or returning a
list of values (sorted one way or another).

Thanks for looking!

//ray


Here's one suggestion.

The routine creates a list of unique entries, and the frequency of each entry.
It then computes the nth most frequent entry and returns all of the strings
that match. (So if different strings have the same frequency, it will return
all of them)

If there are multiple matches, it returns the results in an array and you can
get at them using an Index function.

This works similar to the LARGE function in that if the frequency of two items
is the same, they will take up "two slots". In other words, if the avocados
and peaches both are the most frequent, with 10 occurrences, the will both be
returned in the event Index = 1 or Index = 2.

If this is not what you want, we could add some sorting to the algorithm and
make it return unique values.

================================================== ============
Function LargeStrings(rg As Range, Optional Index As Long = 1) As Variant
Dim cStrings As Collection
Dim c As Range
Dim temp(), tempResults()
Dim i As Long, j As Long

'get list of unique strings
Set cStrings = New Collection
On Error Resume Next
For Each c In rg
If Not (c.Text = "end" Or c.Text = "#NA") Then
cStrings.Add Item:=c.Value, Key:=c.Value
End If
Next c
On Error GoTo 0

If Index cStrings.Count Then
LargeStrings = CVErr(xlErrNum)
Exit Function
End If

'put into array with count
ReDim temp(0 To 1, 0 To cStrings.Count - 1)
ReDim tempCount(0 To cStrings.Count - 1)
ReDim tempItems(0 To cStrings.Count - 1)
For i = 1 To cStrings.Count
temp(0, i - 1) = cStrings(i)
temp(1, i - 1) = WorksheetFunction.CountIf(rg, cStrings(i))
Next i

'get desired index
j = WorksheetFunction.Large(temp, Index)
'put into array
ReDim tempResults(0)
For i = 0 To UBound(temp, 2)
If temp(1, i) = j Then
tempResults(UBound(tempResults)) = temp(0, i)
ReDim Preserve tempResults(UBound(tempResults) + 1)
End If
Next i
ReDim Preserve tempResults(UBound(tempResults) - 1)
If UBound(tempResults) = 0 Then
LargeStrings = tempResults(0)
Else
LargeStrings = tempResults
End If
End Function
===========================================
--ron


  #6   Report Post  
Posted to microsoft.public.excel.programming
Ray Ray is offline
external usenet poster
 
Posts: 267
Default Detect String Frequency on dynamic entries

Again, thanks very much for your time to respond -- both suggestions
appear quite involved, and I appreciate your efforts!

Jan Karel - your revised solution works well, esp when pared with
CountIf (in another cell) to reflect actual frequency ...

Ron, from the description, your solution sounds intriguing and quite
useful but I'm not sure how to implement/call the procedure. The
coding looks like a UDF, but this part of the description
*******************
The routine creates a list of unique entries, and the frequency of
each entry.
It then computes the nth most frequent entry and returns all of the
strings
that match. (So if different strings have the same frequency, it will
return
all of them)
*******************
seems to indicate that the coding would return a range of items (to
multiple cells), based on the user-provided index number.

Could you please clarify the intended implementation of your coding?

Thanks again ...

ray
  #7   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 535
Default Detect String Frequency on dynamic entries

Hi Ray,

Jan Karel - your revised solution works well, esp when pared with
CountIf (in another cell) to reflect actual frequency ...


Excellent. Glad I could help.

Regards,

Jan Karel Pieterse
Excel MVP
http://www.jkp-ads.com
Member of:
Professional Office Developer Association
www.proofficedev.com

  #8   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default Detect String Frequency on dynamic entries

On Fri, 6 Nov 2009 08:27:03 -0800 (PST), Ray wrote:

Again, thanks very much for your time to respond -- both suggestions
appear quite involved, and I appreciate your efforts!

Jan Karel - your revised solution works well, esp when pared with
CountIf (in another cell) to reflect actual frequency ...

Ron, from the description, your solution sounds intriguing and quite
useful but I'm not sure how to implement/call the procedure. The
coding looks like a UDF, but this part of the description
*******************
The routine creates a list of unique entries, and the frequency of
each entry.
It then computes the nth most frequent entry and returns all of the
strings
that match. (So if different strings have the same frequency, it will
return
all of them)
*******************
seems to indicate that the coding would return a range of items (to
multiple cells), based on the user-provided index number.

Could you please clarify the intended implementation of your coding?

Thanks again ...

ray


The function returns an array of values. For example, if you look at the
LINEST worksheet function in Excel, you can see that it returns an array of
values.

In order to display the results, you would "array-enter" the formula over
multiple cells (i.e. with <ctrl-shift-enter).

Or you could use the INDEX function to return each element of the array.

For example, given these entries in various cells in the range A1:I18

================================================== ==========
rugs

shampoo

shampoo potatotes end

#N/A
string beans potatoes
potatoes
potatoes string
beans
potatoes

holes
holes

================================================== ================

If you enter the formula:

=LargeStrings(A1:I18,2)

you would see displayed "shampoo"

However, it is really returning a horizontal array:

{"shampoo","string beans","holes"}

since all three have the 2nd largest frequency.

To "see" these you could either enter this as an array formula in at least
three cells (and if you wanted to see them in a vertical array you could
transpose the formula); or you could use the INDEX function, e.g.:

So if you wanted to display the results in a vertical array, you could enter a
formula like:

=IF(COUNTA(LargeStrings($A$1:$I$18,2))=ROWS($1:1) ,INDEX(LargeStrings($A$1:$I$18,2),ROWS($1:1)),"")

in some cell and fill down until you were returning blanks.

In the above example, if that formula were entered in A20 and you filled down,
you would see:

A20: shampoo
A21: string beans
A22: holes
--ron
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Counting the frequency of a text string Michael Styles Excel Discussion (Misc queries) 2 May 19th 09 07:08 PM
String Frequency Brad Excel Programming 2 February 3rd 09 08:32 AM
Frequency of a word string in an excel database Danny_McCaslin Excel Discussion (Misc queries) 2 December 13th 06 06:11 PM
how to Detect paragraph in cell in string? dm16s Excel Programming 1 June 22nd 05 09:22 PM
frequency of characters within a string Nick Excel Programming 4 July 24th 03 07:52 PM


All times are GMT +1. The time now is 11:56 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"