View Single Post
  #3   Report Post  
Alan
 
Posts: n/a
Default

"Jason Morin" wrote
in message ...

Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1)))

Then sort the data based on this column. Similar strings
should be closely grouped.

HTH
Jason
Atlanta, GA


Hi Jason,

I like the simplicity of your approach - just adding the code values
together.

However, if I change my example slightly and apply the formula I get:

Alan B Chadd................978
Bob Charles...................1013
Chadd A B.....................663


I think that in a larger sample we would still find that the first and
third lines are too far apart for a human to spot them together.

The problem arises because it is very sensitive to the number of
characters. Whether someone has, for example, entered a middle name
or not, will affect the score too much. Similarly the scores for the
following two versions of a company name are too far apart:

ABC LTD...............458
abc limited...............1070


I have tried thinking about a way to modify your approach to make it
work better under this circumstance. I could apply an UPPER function
to each string first giving the improved result of:

ABC LTD...............458
ABC LIMITED........750


It can be further improved by subtracting 31 from the code values (a
space is char(32) so let's just avoid negative results):

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))-31)

This gives a further improvement:

ABC LTD........................241
ABC LIMITED...............409


But this is still likely to put them a long way apart in a sort and it
*feels* to me that we are just compressing the distribution of results
rather than improving the ability to identify similarities.

Do you have any other ideas?

Thanks for your help,

Alan.