View Single Post
  #4   Report Post  
Jason Morin
 
Posts: n/a
Default

Well, with the examples you gave, my example worked
pretty well. It's obvious that you data is much more
inconsistent in format than your original post lead me to
believe.

What you're looking for is an extremely complex
algorithem to handle such wide disparities in the data.
Even then, you won't get close on some of them. I think
the solution lies upstream in the process, before
receiving the data. If you have any influence on your
sources, you would try to insert some type of consistency
in the way they report their data, so that you don't end
up in the mess you're in now.

I'd suggest applying what I've given you, enhance it as
much as you can, then suck it up and do a manual check on
the rest.

Jason

-----Original Message-----
"Jason Morin" wrote
in message ...

Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN

(A1))),1)))

Then sort the data based on this column. Similar

strings
should be closely grouped.

HTH
Jason
Atlanta, GA


Hi Jason,

I like the simplicity of your approach - just adding the

code values
together.

However, if I change my example slightly and apply the

formula I get:

Alan B Chadd................978
Bob Charles...................1013
Chadd A B.....................663


I think that in a larger sample we would still find that

the first and
third lines are too far apart for a human to spot them

together.

The problem arises because it is very sensitive to the

number of
characters. Whether someone has, for example, entered a

middle name
or not, will affect the score too much. Similarly the

scores for the
following two versions of a company name are too far

apart:

ABC LTD...............458
abc limited...............1070


I have tried thinking about a way to modify your

approach to make it
work better under this circumstance. I could apply an

UPPER function
to each string first giving the improved result of:

ABC LTD...............458
ABC LIMITED........750


It can be further improved by subtracting 31 from the

code values (a
space is char(32) so let's just avoid negative results):

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1) )),1))-

31)

This gives a further improvement:

ABC LTD........................241
ABC LIMITED...............409


But this is still likely to put them a long way apart in

a sort and it
*feels* to me that we are just compressing the

distribution of results
rather than improving the ability to identify

similarities.

Do you have any other ideas?

Thanks for your help,

Alan.






.