View Single Post
  #1   Report Post  
Alan
 
Posts: n/a
Default Evaluating similarity of text strings


Hi All,

I have a data list - roughly 2000 items in all, which are names
(people, companies, organisations etc) and an amount they have paid.

The data is collected from four separate sources and trimmed already
to
avoid extra spaces. The upshot is that there are instances where the
same name is entered more than one way. For example:

Alan B Chadd
Bob Charles
Chadd Alan B

Now a human eye can easily spot that the first and third are
(probably) the same person and it is worth investigating further.
This is fine when they are close to each other, but not practical
when they are two pages apart.

Therefore, I would like to write a formula that could give a score
that
ranks the likely similar entries.

I am thinking that something like this would be good:

1) Take each letter of the target name, and count how many times it
appears in every other of the 2000 entries ignoring capitals (it is
possible
someone will have typed in a name with or without any proper
capitalisation).

2) Add up the totals for each of the 2000 entries

3) Show any items with a score over X (to be picked by trial) or just
sort them by the score.


Example From Above:

The score for "Alan B Chadd" against "Bob Charles" would be:

A = 0
l = 1
a = 0
n = 0
Space = 1
B = 1
Space = 1
C = 1
h = 1
a = 0
d = 0
d = 0

Total = 5

The score for "Alan B Chadd" against "Chadd Alan B" would be:

A = 3
l = 1
a = 3
n = 1
Space = 2
B = 1
Space = 2
C = 1
h = 1
a = 3
d = 2
d = 2

Total = 22

Obviously it is not perfect, but it should be good enough to point the
human in the right direction!


Can anyone suggest a way to do this?

Thanks in advance,

Alan.