"Jason Morin" wrote
in message ...
Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:
=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1)))
Then sort the data based on this column. Similar strings
should be closely grouped.
HTH
Jason
Atlanta, GA
Hi Jason,
I like the simplicity of your approach - just adding the code values
together.
However, if I change my example slightly and apply the formula I get:
Alan B Chadd................978
Bob Charles...................1013
Chadd A B.....................663
I think that in a larger sample we would still find that the first and
third lines are too far apart for a human to spot them together.
The problem arises because it is very sensitive to the number of
characters. Whether someone has, for example, entered a middle name
or not, will affect the score too much. Similarly the scores for the
following two versions of a company name are too far apart:
ABC LTD...............458
abc limited...............1070
I have tried thinking about a way to modify your approach to make it
work better under this circumstance. I could apply an UPPER function
to each string first giving the improved result of:
ABC LTD...............458
ABC LIMITED........750
It can be further improved by subtracting 31 from the code values (a
space is char(32) so let's just avoid negative results):
=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))-31)
This gives a further improvement:
ABC LTD........................241
ABC LIMITED...............409
But this is still likely to put them a long way apart in a sort and it
*feels* to me that we are just compressing the distribution of results
rather than improving the ability to identify similarities.
Do you have any other ideas?
Thanks for your help,
Alan.
|