Thread: A Challenge
View Single Post
  #5   Report Post  
Posted to microsoft.public.excel.programming
William Benson[_2_] William Benson[_2_] is offline
external usenet poster
 
Posts: 230
Default A Challenge

I suggest you use Access, I would. I would plan a final table layout that
has every field I might possibly need:

1st Stage... Prefixes:
Dr., Drs., Rev., Mr., Mr. and Mrs., Col., Lt., Sir, Hon., etc...
Intermediate result Fields: Original Name, Prefix, Remainder1.

2nd Stage... Suffix1 (Remember, there could be more and more suffixes).
M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
Intermediate result Fields: Original, Prefix, Suffix1, Remainder2.

If there is no records with any values in Suffix1, move on -- otherwise
process for:
3rd Stage... Suffix2 (Remember, there could be more and more suffixes).
M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
Remainder3.

If there is no records with any values in Suffix2, move on -- otherwise
process for:
4th Stage... Suffix3 (Remember, there could be more and more suffixes).
M.D., J.D., CPA, Ph. D. (or PhD.), Sr., Jr., Senior, Junior, etc...
Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
Suffix3, Remainder4.

If there is no records with any values in Suffix3, move on -- otherwise
process for:
Intermediate result Fields: Original, Prefix, Suffix1, Suffix2,
Suffix3, Suffix4, Remainder5. and so on.

Now comes the hard logic. Try to eliminate as many obvious, simple choices
as possible. Say, Remainder5 has no apostrophes, no hyphens, no "St." as in
"St. James", all the wierd things you can think of are not there, as well as
it is clearly in the format: "XXX* Y. ZZZ*" ... move all such records
to a "possibly complete" table, and inspect as best you can.

Then look over the remaining data, and see where you need to go next.

The point I want to hit home, having done this before, is don't go crazy
with logic until you find you have to. Another successful strategy is to put
all your known problems into one place, like you could put all the names
with hyphens in them in their own table after you get the prefixes and the
suffixes nailed down, and go back looking for additional logic you need
later. The nice part of Access is the routines are quick, it is easy to
filter on the fly (My favorite feature is "Filter Excluding Selection"), and
saving data and backing up is a breeze PLUS you never have to worry about
the number of records.

Good Luck, I know you are really smart and will conquer 99.9% of this. I
guess it's the remaining 1/10 of 1 percent that makes the miracle :-0



"


You have written very clearly, so it shows you are a very clear thinker and
probably do not need me to offer guidance but I will do anyway because you
asked. I would definitely use "stages". At each stage, I would pull out what
I can conclude about the
"mjones" wrote in message
oups.com...
Hi,

My client is expecting me to perform miracles. I thought I'd throw
this one out for those who like a challenge. Many large spreadsheets
(so large they have to be split up to get them into Excel) have a name
column. An example of the type of data might be:

Mary Jane Fox
Marg F. Smith
Matt Del Sandro
Frank George Di Marco
Paul Joseph Rental/Storage Ltd.
Kim O'Neil
Don Jones-St. James
Andrew K.G. Black
Mr. and Mrs. Albert F. Casey
Brown
Sisters of the Holly Cross of 3rd Street
M Thomas
Christine J. Main & Ken Henry

My client wants me to separate the last name into its own column so
they can run duplicate tests with other files against it and other
things. Remember, there could be 65,000 names.

As you can imagine, this is very time consuming so anything will help.
I have been using space delimited to separate the fields and then doing
things like sorting or splitting off the first two characters and
sorting what's left to separate out the single initials and then
concatenating back again.

I know this can't be an exact science, but I'm wondering if a macro can
be written to perform some of the work and perhaps stop and ask about
questionable names. For example,

- single letters with or without periods would be first names
- a database of words could show companies and take the whole word as
last name like Ltd. or Co.
- a database of words like Di, St., Del, O', or Le could be defined as
part of the last name
- one word names are last names
- take the last of two names and forget the first one
- stuff like that

When you use text to columns space delimited it puts the first word in
the first column and second in second column, etc. so depending upon
how many words are in the name, you never know which column the last
word will end up in. It would be nice to make it start in the last
column. We could use Access, too, if that would make any difference.

I don't expect to receive much help on this, but you never know. I
find some people in this newsgroup have the most amazing ideas.

Thanks,

Michele