View Single Post
  #4   Report Post  
Posted to microsoft.public.excel.misc
Ron Rosenfeld Ron Rosenfeld is offline
external usenet poster
 
Posts: 5,651
Default word frequency counting

On Fri, 7 Mar 2008 22:56:28 -0800 (PST), LAN MIND wrote:

Hello all,

I am trying to optimize my database. I need to find the top occurring
words on my excel file.

Is there any freeware or add-ons that can perform word frequency
counting on excel files?

Thanks for any help- Lan


Here's a UDF that should get you started. There may be more efficient methods,
but I was using some "new to me" techniques here.

It returns a two-dimensional array consisting of the unique words; and the
count of each of those unique words.

By the way, a "word" is defined as a collection of word characters (bounded by
a non-word character or the beginning or end of the line). A word character is
defined as being in the class of the alphabet (A-Za-z), a digit (0-9) or the
underscore (_). If this definition of "word" gives unwanted results, it can be
changed.

To enter this, <alt-F11 opens the VBEditor. Ensure your project is
highlighted in the project explorer window, then Insert/Module and paste the
code below into the window that opens.

THEN: Select Tools/References and set a reference to Microsoft VBScript
Regular Expressions 5.5

There are several ways to display the results.

Assuming your "data" is in A1:A3, enter a formula into some cell:

First word
D1: =INDEX(uniquecount($A$1:$A$3),1,ROWS($1:1))

Count of first word
E1: =INDEX(uniquecount($A$1:$A$3),2,ROWS($1:1))

Then select D1:E1 and fill down as far as required. If you go to far, you'll
see #REF errors.

This might be better for you sorted, but I don't have time to do that right
now.

Once you have the results, you can copy/paste-special Values to some other area
of your worksheet, and then sort on the values.

If this is going to be used frequently, a sort routine can be incorporated.

===============================================
Option Explicit
Function UniqueCount(rg As Range) As Variant
'Requires reference to Microsoft VBScript Regular Expressions 5.5
'Returns a two dimensional array of unique words and count
Dim cWordList As Collection
Dim Str As String
Dim sRes() As Variant
Dim i As Long, j As Long
Dim c As Range
Dim re As RegExp
Dim mc As MatchCollection, m As Match

'Put all words into a single string
For Each c In rg
Str = Str & c.Value & " "
Next c

'get list of unique words
Set re = New RegExp
re.Global = True
re.Pattern = "\b\w+\b"
Set cWordList = New Collection
On Error Resume Next
'Add method with index=word will give error on duplicates
Set mc = re.Execute(Str)
For Each m In mc
cWordList.Add m.Value, m.Value
Next m
On Error GoTo 0
ReDim sRes(0 To 1, 1 To cWordList.Count)
For i = 1 To cWordList.Count
sRes(0, i) = cWordList(i)
Next i

'get word count for each word
re.Global = True
re.IgnoreCase = True
For i = 1 To UBound(sRes, 2)
re.Pattern = "\b" & sRes(0, i) & "\b"
Set mc = re.Execute(Str)
sRes(1, i) = mc.Count
Next i
Set re = Nothing

UniqueCount = sRes
End Function
===============================
--ron