View Single Post
  #22   Report Post  
Karl Burrows
 
Posts: n/a
Default

Thus, the problem. Because each builder can have a unique name with some of
those qualifiers in there, it is about impossible to identify their true
name. I think maybe developing a naming convention to add the hyphen or
something else that we can tell it to strip everything right of that
character is going to be the only way to go. As it is now, it is truly
"fuzzy logic!"

Thanks for all the help!

"Harlan Grove" wrote in message
oups.com...
Karl Burrows wrote...
....
It boils down to taking a column of builder names and stripping off the
data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all
the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.

....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.