View Single Post
  #8   Report Post  
Posted to microsoft.public.excel.misc
Ron Rosenfeld Ron Rosenfeld is offline
external usenet poster
 
Posts: 5,651
Default word frequency counting

On Sat, 08 Mar 2008 18:51:21 -0500, Ron Rosenfeld
wrote:

Assuming your "data" is in A1:A3, enter a formula into some cell:

First word
D1: =INDEX(uniquecount($A$1:$A$3),1,ROWS($1:1))

Count of first word
E1: =INDEX(uniquecount($A$1:$A$3),2,ROWS($1:1))


I've been doing some timing test, and the above method is very inefficient with
large numbers of words.

The problem is that, in setting up formulas as above, the UDF will be
recalculated for each formula.

The UDF itself takes a while. The part that takes the longest is getting the
count on each unique word; and the sorting routines also take a while.

I examined a web page that had 21,011 words of which 2,526 were unique.

The total time to run the UDF was about 27 seconds.
Getting the count took 14 seconds; the alpha sort took 8.625 seconds, and the
numeric sort 2.8 seconds.

Given that, it will be much more efficient to enter this function as an array,
of the appropriate size and shape. With that sort of entry, the UDF only needs
to be calculated once.

For example, if you wanted to know the ten most frequent words, you could

Select a 2 column x 10 row array
In the upper left cell, enter the formula:

=transpose(uniquecount(rng))

where "rng" is the cells containing the text.

Hold down <ctrl<shift while you hit <enter.

The formula will fill the entire area, and will only need to calculate once.
--ron