View Single Post
  #21   Report Post  
Harlan Grove
 
Posts: n/a
Default

Karl Burrows wrote...
....
It boils down to taking a column of builder names and stripping off the data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.

....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.