View Single Post
  #5   Report Post  
Posted to microsoft.public.excel.misc
Ron Rosenfeld Ron Rosenfeld is offline
external usenet poster
 
Posts: 5,651
Default word frequency counting

On Sat, 08 Mar 2008 18:51:21 -0500, Ron Rosenfeld
wrote:

On Fri, 7 Mar 2008 22:56:28 -0800 (PST), LAN MIND wrote:

Hello all,

I am trying to optimize my database. I need to find the top occurring
words on my excel file.

Is there any freeware or add-ons that can perform word frequency
counting on excel files?

Thanks for any help- Lan


Here's a UDF that should get you started. There may be more efficient methods,
but I was using some "new to me" techniques here.

It returns a two-dimensional array consisting of the unique words; and the
count of each of those unique words.

By the way, a "word" is defined as a collection of word characters (bounded by
a non-word character or the beginning or end of the line). A word character is
defined as being in the class of the alphabet (A-Za-z), a digit (0-9) or the
underscore (_). If this definition of "word" gives unwanted results, it can be
changed.

To enter this, <alt-F11 opens the VBEditor. Ensure your project is
highlighted in the project explorer window, then Insert/Module and paste the
code below into the window that opens.

THEN: Select Tools/References and set a reference to Microsoft VBScript
Regular Expressions 5.5

There are several ways to display the results.

Assuming your "data" is in A1:A3, enter a formula into some cell:

First word
D1: =INDEX(uniquecount($A$1:$A$3),1,ROWS($1:1))

Count of first word
E1: =INDEX(uniquecount($A$1:$A$3),2,ROWS($1:1))

Then select D1:E1 and fill down as far as required. If you go to far, you'll
see #REF errors.

This might be better for you sorted, but I don't have time to do that right
now.

Once you have the results, you can copy/paste-special Values to some other area
of your worksheet, and then sort on the values.

If this is going to be used frequently, a sort routine can be incorporated.

===============================================
Option Explicit
Function UniqueCount(rg As Range) As Variant
'Requires reference to Microsoft VBScript Regular Expressions 5.5
'Returns a two dimensional array of unique words and count
Dim cWordList As Collection
Dim Str As String
Dim sRes() As Variant
Dim i As Long, j As Long
Dim c As Range
Dim re As RegExp
Dim mc As MatchCollection, m As Match

'Put all words into a single string
For Each c In rg
Str = Str & c.Value & " "
Next c

'get list of unique words
Set re = New RegExp
re.Global = True
re.Pattern = "\b\w+\b"
Set cWordList = New Collection
On Error Resume Next
'Add method with index=word will give error on duplicates
Set mc = re.Execute(Str)
For Each m In mc
cWordList.Add m.Value, m.Value
Next m
On Error GoTo 0
ReDim sRes(0 To 1, 1 To cWordList.Count)
For i = 1 To cWordList.Count
sRes(0, i) = cWordList(i)
Next i

'get word count for each word
re.Global = True
re.IgnoreCase = True
For i = 1 To UBound(sRes, 2)
re.Pattern = "\b" & sRes(0, i) & "\b"
Set mc = re.Execute(Str)
sRes(1, i) = mc.Count
Next i
Set re = Nothing

UniqueCount = sRes
End Function
===============================
--ron


With a little more fooling around, I modified the above to include words with
apostrophe's; and also did a double sort so the most common words would at the
top; and the subsort would be alphabetical.

If you don't want the results sorted, just comment out one or both of the two
sorting lines below.

=========================================
Option Explicit
Option Compare Text
Function UniqueCount(rg As Range) As Variant
'Requires reference to Microsoft VBScript Regular Expressions 5.5
'Returns a two dimensional array of unique words and count
Dim cWordList As Collection
Dim Str As String
Dim sRes() As Variant
Dim i As Long, j As Long
Dim c As Range
Dim re As RegExp
Dim mc As MatchCollection, m As Match

'Put all words into a single string
For Each c In rg
Str = Str & c.Value & " "
Next c

'get list of unique words
Set re = New RegExp
re.Global = True
re.Pattern = "\b[\w']+\b"
Set cWordList = New Collection
On Error Resume Next
'Add method with index=word will give error on duplicates
Set mc = re.Execute(Str)
For Each m In mc
cWordList.Add m.Value, m.Value
Next m
On Error GoTo 0
ReDim sRes(0 To 1, 1 To cWordList.Count)
For i = 1 To cWordList.Count
sRes(0, i) = cWordList(i)
Next i

'get word count for each word
re.Global = True
re.IgnoreCase = True
For i = 1 To UBound(sRes, 2)
re.Pattern = "\b" & sRes(0, i) & "\b"
Set mc = re.Execute(Str)
sRes(1, i) = mc.Count
Next i
Set re = Nothing

'you can comment out one or both of the sort lines
' depending on your requirements

'Sort words alphabetically A-Z
BubbleSort sRes, 0, True

'then sort by Count highest to lowest
BubbleSort sRes, 1, False

UniqueCount = sRes
End Function
'--------------------------------------------------------------
Private Sub BubbleSort(TempArray As Variant, d As Long, _
bSortDirection As Boolean)
'bSortDirection = True means sort ascending
'bSortDirection = False means sort descending
Dim Temp1 As Variant, Temp2
Dim i As Long
Dim NoExchanges As Boolean
Dim Exchange As Boolean

' Loop until no more "exchanges" are made.
Do
NoExchanges = True

' Loop through each element in the array.
For i = 1 To UBound(TempArray, 2) - 1

' If the element is greater/less than the element
' following it, exchange the two elements.

Exchange = IIf(TempArray(d, i) < TempArray(d, i + 1), True, False)
If bSortDirection = True Then Exchange = Not Exchange
If Exchange Then
NoExchanges = False
Temp1 = TempArray(0, i)
Temp2 = TempArray(1, i)
TempArray(0, i) = TempArray(0, i + 1)
TempArray(1, i) = TempArray(1, i + 1)
TempArray(0, i + 1) = Temp1
TempArray(1, i + 1) = Temp2
End If
Next i
Loop While Not (NoExchanges)
End Sub
============================================
--ron