View Single Post
  #10   Report Post  
Posted to microsoft.public.excel.misc
Ron Rosenfeld Ron Rosenfeld is offline
external usenet poster
 
Posts: 5,651
Default How to get a numbered list of unique words in a column?

On Thu, 18 Jun 2009 08:36:02 -0700, J741
wrote:

Hi.

I have a spreadsheet with almost 5000 rows with about 15 columns of data, 2
columns of which contain sentences, phrases, or paragraphs of comments
entered by users.

What I need to have as an end result is a list of all unique words found in
those two columns, along with the number of occurrences of each word. I will
then use this to further analyze the data in the spreadsheet.

Right now, I do this manually and it takes a VERY long time for me to do so
(over 200 man-hours). So, I really need to automate this somehow.

The problem is, I have no idea how to begin, because the words will not be
by themselves in their own cell (so I cant use the auto filter functions,
pivot tables, or anything else I can think of), but will be part of groups of
words or symbols within multiple cells.

I think this will need to be done programmatically with code, scripts,
macros, or some other method with which I am not familiar, but I am just not
sure.

Can anyone help me with this? Can anyone point me in the right direction?
Is there anything already built in to excel to do this?

- James.


Here's a start.

You'll need to properly set the range to process (rSrc in the code) and the
range where you want the results (rDest in the code) to match your sheets.

For example, you might set rsrc = range("A1:B5000") to encompass 5000 lines in
two columns.

And, of course, you'll need to set rDest = some cell that is outside your data
range.

I assumed you wanted the results sorted by frequency of the word, with the most
common word being first; but this can be easily changed.

Note also that I formatted the entire first column of rDest as TEXT. Without
this, any numeric strings in the data would be changed to numbers (if we left
the format as General). So long strings might get truncated, or displayed in
scientific notation; and leading zeros would be stripped off.

Also, for this initial example, words are defined as strings containing only
letters, digits, slash or a hyphen. This is done in order to remove
punctuation. But it will also remove other substrings that might include other
characters. If this will be an issue, changes can be easily made.

To enter this Macro (Sub), <alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub), <alt-F8 opens the macro dialog box. Select the macro
by name, and <RUN.

========================================
Option Explicit
Sub UniqueWordList()
Dim rSrc As Range, rDest As Range, c As Range
Dim cWordList As Collection
Dim res() As Variant
Dim w() As String
Dim i As Long

Set cWordList = New Collection
Set rSrc = Range("A1:B22")
Set rDest = Range("M1")
rDest.EntireColumn.NumberFormat = "@"
For Each c In rSrc
w = Split(c.Value)
For i = 0 To UBound(w)
w(i) = StripWord(w(i))
If Not w(i) = "" Then
On Error Resume Next
cWordList.Add Item:=w(i), Key:=w(i)
On Error GoTo 0
End If
Next i
Next c

'transfer words to results array
ReDim res(1 To cWordList.Count, 0 To 1)
For i = 1 To cWordList.Count
res(i, 0) = cWordList(i)
Next i

'get counts
For i = LBound(res) To UBound(res)
For Each c In rSrc
res(i, 1) = res(i, 1) + CountWord(c.Value, res(i, 0))
Next c
Next i

'sort alpha: d=0; sort numeric d=1
'there are various ways of sorting
BubbleSort res, 1
For i = LBound(res) To UBound(res)
rDest.Offset(i, 0).Value = res(i, 0)
rDest.Offset(i, 1).Value = res(i, 1)
Next i
End Sub
Private Function StripWord(s As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
'allow only letters, digits, slashes and hyphens
re.Pattern = "[^-/A-Za-z0-9]"
StripWord = re.Replace(s, "")
Set re = Nothing
End Function
Private Function CountWord(ByVal s As String, sPat) As Long
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "\b" & sPat & "\b"

Set mc = re.Execute(s)
CountWord = mc.Count
End Function
Private Sub BubbleSort(TempArray As Variant, d As Long) 'd is 0 based dimension
Dim temp(0, 1) As Variant
Dim i As Integer
Dim NoExchanges As Integer

' Loop until no more "exchanges" are made.
Do
NoExchanges = True

' Loop through each element in the array.
For i = LBound(TempArray) To UBound(TempArray) - 1

' If the element is less than the element
' following it, exchange the two elements.
' change "<" to "" to sort ascending
If TempArray(i, d) < TempArray(i + 1, d) Then
NoExchanges = False
temp(0, 0) = TempArray(i, 0)
temp(0, 1) = TempArray(i, 1)
TempArray(i, 0) = TempArray(i + 1, 0)
TempArray(i, 1) = TempArray(i + 1, 1)
TempArray(i + 1, 0) = temp(0, 0)
TempArray(i + 1, 1) = temp(0, 1)

End If
Next i
Loop While Not (NoExchanges)
End Sub
======================================
--ron