View Single Post
  #9   Report Post  
Posted to microsoft.public.excel.misc
LAN MIND LAN MIND is offline
external usenet poster
 
Posts: 7
Default word frequency counting

On Mar 9, 6:08 am, Ron Rosenfeld wrote:
On Sat, 08 Mar 2008 18:51:21 -0500, Ron Rosenfeld
wrote:

Assuming your "data" is in A1:A3, enter a formula into some cell:


First word
D1: =INDEX(uniquecount($A$1:$A$3),1,ROWS($1:1))


Count of first word
E1: =INDEX(uniquecount($A$1:$A$3),2,ROWS($1:1))


I've been doing some timing test, and the above method is very inefficient with
large numbers of words.

The problem is that, in setting up formulas as above, the UDF will be
recalculated for each formula.

The UDF itself takes a while. The part that takes the longest is getting the
count on each unique word; and the sorting routines also take a while.

I examined a web page that had 21,011 words of which 2,526 were unique.

The total time to run the UDF was about 27 seconds.
Getting the count took 14 seconds; the alpha sort took 8.625 seconds, and the
numeric sort 2.8 seconds.

Given that, it will be much more efficient to enter this function as an array,
of the appropriate size and shape. With that sort of entry, the UDF only needs
to be calculated once.

For example, if you wanted to know the ten most frequent words, you could

Select a 2 column x 10 row array
In the upper left cell, enter the formula:

=transpose(uniquecount(rng))

where "rng" is the cells containing the text.

Hold down <ctrl<shift while you hit <enter.

The formula will fill the entire area, and will only need to calculate once.
--ron


Wow Ron thanks for all that code. I haven't had anytime lately to go
over much of it but I will tonight. Truthfully I'm lost on almost all
of what you've shown as I'm not a programmer and I have very little
experience with excel. A slow count doesn't bother me much as this
will be done locally for in house results.

Thanks : )