Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Alan
 
Posts: n/a
Default Evaluating similarity of text strings


Hi All,

I have a data list - roughly 2000 items in all, which are names
(people, companies, organisations etc) and an amount they have paid.

The data is collected from four separate sources and trimmed already
to
avoid extra spaces. The upshot is that there are instances where the
same name is entered more than one way. For example:

Alan B Chadd
Bob Charles
Chadd Alan B

Now a human eye can easily spot that the first and third are
(probably) the same person and it is worth investigating further.
This is fine when they are close to each other, but not practical
when they are two pages apart.

Therefore, I would like to write a formula that could give a score
that
ranks the likely similar entries.

I am thinking that something like this would be good:

1) Take each letter of the target name, and count how many times it
appears in every other of the 2000 entries ignoring capitals (it is
possible
someone will have typed in a name with or without any proper
capitalisation).

2) Add up the totals for each of the 2000 entries

3) Show any items with a score over X (to be picked by trial) or just
sort them by the score.


Example From Above:

The score for "Alan B Chadd" against "Bob Charles" would be:

A = 0
l = 1
a = 0
n = 0
Space = 1
B = 1
Space = 1
C = 1
h = 1
a = 0
d = 0
d = 0

Total = 5

The score for "Alan B Chadd" against "Chadd Alan B" would be:

A = 3
l = 1
a = 3
n = 1
Space = 2
B = 1
Space = 2
C = 1
h = 1
a = 3
d = 2
d = 2

Total = 22

Obviously it is not perfect, but it should be good enough to point the
human in the right direction!


Can anyone suggest a way to do this?

Thanks in advance,

Alan.




  #2   Report Post  
Jason Morin
 
Posts: n/a
Default

Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1)))

Then sort the data based on this column. Similar strings
should be closely grouped.

HTH
Jason
Atlanta, GA

-----Original Message-----

Hi All,

I have a data list - roughly 2000 items in all, which

are names
(people, companies, organisations etc) and an amount

they have paid.

The data is collected from four separate sources and

trimmed already
to
avoid extra spaces. The upshot is that there are

instances where the
same name is entered more than one way. For example:

Alan B Chadd
Bob Charles
Chadd Alan B

Now a human eye can easily spot that the first and third

are
(probably) the same person and it is worth investigating

further.
This is fine when they are close to each other, but not

practical
when they are two pages apart.

Therefore, I would like to write a formula that could

give a score
that
ranks the likely similar entries.

I am thinking that something like this would be good:

1) Take each letter of the target name, and count how

many times it
appears in every other of the 2000 entries ignoring

capitals (it is
possible
someone will have typed in a name with or without any

proper
capitalisation).

2) Add up the totals for each of the 2000 entries

3) Show any items with a score over X (to be picked by

trial) or just
sort them by the score.


Example From Above:

The score for "Alan B Chadd" against "Bob Charles" would

be:

A = 0
l = 1
a = 0
n = 0
Space = 1
B = 1
Space = 1
C = 1
h = 1
a = 0
d = 0
d = 0

Total = 5

The score for "Alan B Chadd" against "Chadd Alan B"

would be:

A = 3
l = 1
a = 3
n = 1
Space = 2
B = 1
Space = 2
C = 1
h = 1
a = 3
d = 2
d = 2

Total = 22

Obviously it is not perfect, but it should be good

enough to point the
human in the right direction!


Can anyone suggest a way to do this?

Thanks in advance,

Alan.




.

  #3   Report Post  
Alan
 
Posts: n/a
Default

"Jason Morin" wrote
in message ...

Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1)))

Then sort the data based on this column. Similar strings
should be closely grouped.

HTH
Jason
Atlanta, GA


Hi Jason,

I like the simplicity of your approach - just adding the code values
together.

However, if I change my example slightly and apply the formula I get:

Alan B Chadd................978
Bob Charles...................1013
Chadd A B.....................663


I think that in a larger sample we would still find that the first and
third lines are too far apart for a human to spot them together.

The problem arises because it is very sensitive to the number of
characters. Whether someone has, for example, entered a middle name
or not, will affect the score too much. Similarly the scores for the
following two versions of a company name are too far apart:

ABC LTD...............458
abc limited...............1070


I have tried thinking about a way to modify your approach to make it
work better under this circumstance. I could apply an UPPER function
to each string first giving the improved result of:

ABC LTD...............458
ABC LIMITED........750


It can be further improved by subtracting 31 from the code values (a
space is char(32) so let's just avoid negative results):

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))-31)

This gives a further improvement:

ABC LTD........................241
ABC LIMITED...............409


But this is still likely to put them a long way apart in a sort and it
*feels* to me that we are just compressing the distribution of results
rather than improving the ability to identify similarities.

Do you have any other ideas?

Thanks for your help,

Alan.






  #4   Report Post  
Jason Morin
 
Posts: n/a
Default

Well, with the examples you gave, my example worked
pretty well. It's obvious that you data is much more
inconsistent in format than your original post lead me to
believe.

What you're looking for is an extremely complex
algorithem to handle such wide disparities in the data.
Even then, you won't get close on some of them. I think
the solution lies upstream in the process, before
receiving the data. If you have any influence on your
sources, you would try to insert some type of consistency
in the way they report their data, so that you don't end
up in the mess you're in now.

I'd suggest applying what I've given you, enhance it as
much as you can, then suck it up and do a manual check on
the rest.

Jason

-----Original Message-----
"Jason Morin" wrote
in message ...

Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down:

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN

(A1))),1)))

Then sort the data based on this column. Similar

strings
should be closely grouped.

HTH
Jason
Atlanta, GA


Hi Jason,

I like the simplicity of your approach - just adding the

code values
together.

However, if I change my example slightly and apply the

formula I get:

Alan B Chadd................978
Bob Charles...................1013
Chadd A B.....................663


I think that in a larger sample we would still find that

the first and
third lines are too far apart for a human to spot them

together.

The problem arises because it is very sensitive to the

number of
characters. Whether someone has, for example, entered a

middle name
or not, will affect the score too much. Similarly the

scores for the
following two versions of a company name are too far

apart:

ABC LTD...............458
abc limited...............1070


I have tried thinking about a way to modify your

approach to make it
work better under this circumstance. I could apply an

UPPER function
to each string first giving the improved result of:

ABC LTD...............458
ABC LIMITED........750


It can be further improved by subtracting 31 from the

code values (a
space is char(32) so let's just avoid negative results):

=SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1) )),1))-

31)

This gives a further improvement:

ABC LTD........................241
ABC LIMITED...............409


But this is still likely to put them a long way apart in

a sort and it
*feels* to me that we are just compressing the

distribution of results
rather than improving the ability to identify

similarities.

Do you have any other ideas?

Thanks for your help,

Alan.






.

  #5   Report Post  
Bob Phillips
 
Posts: n/a
Default


"Jason Morin" wrote in message
...


I'd suggest applying what I've given you, enhance it as
much as you can, then suck it up and do a manual check on
the rest.


or use it as supporting material as to how difficult the problem is, and how
it is better to fix the source of the problem, not the manifestation.




  #6   Report Post  
Alan
 
Posts: n/a
Default

"Jason Morin"
wrote in message ...

Well, with the examples you gave, my example worked
pretty well. It's obvious that you data is much more
inconsistent in format than your original post lead me to
believe.


Agreed - I do appreciate your assistance.


What you're looking for is an extremely complex
algorithem to handle such wide disparities in the data.
Even then, you won't get close on some of them. I think
the solution lies upstream in the process, before
receiving the data. If you have any influence on your
sources, you would try to insert some type of consistency
in the way they report their data, so that you don't end
up in the mess you're in now.


Unfortunately the data is already in existence for many periods and
that is what we have to work with.

The systems going forwards are fully integrated so the issue does
exist in the same way.


I'd suggest applying what I've given you, enhance it as
much as you can, then suck it up and do a manual check on
the rest.

Jason


I wish we could afford the manual time, it would be easier from a
management perspective at least - I will keep on looking for an
algorithm, but thank you again for your assistance.

Alan.



  #7   Report Post  
Alan
 
Posts: n/a
Default


"Bob Phillips" wrote in message
...

"Jason Morin" wrote in message
...


I'd suggest applying what I've given you, enhance it as
much as you can, then suck it up and do a manual check on
the rest.


or use it as supporting material as to how difficult the problem is,
and how it is better to fix the source of the problem, not the
manifestation.


Hi Bob,

Unfortunately the data is already in existence for many periods and
that is what we have to work with for the statutory returns.

The systems going forwards are fully integrated so the issue does
exist in the same way - the source of the problem is already fixed,
now I just have to find a way to report the history accurately enough
for the authorities.

I will keep looking for an algorithm - if you have any other ideas,
please do post back.

Thanks,

Alan.





Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
space between text strings with concatenate Jeff Excel Discussion (Misc queries) 2 March 3rd 05 06:54 PM
Compare cells/columns and highlight matching text strings luxbelle Excel Worksheet Functions 1 February 25th 05 06:34 PM
How to make a cell recognize multiple text strings? Tourcat Excel Worksheet Functions 1 February 8th 05 08:29 PM
Read Text File into Excel Using VBA Willie T Excel Discussion (Misc queries) 13 January 8th 05 12:37 AM
Filter long Text strings Cimorene Excel Worksheet Functions 0 December 1st 04 04:01 AM


All times are GMT +1. The time now is 12:45 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"