Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Someone brilliant have code to parse messy Address City State ZIP into 4
separate fields? If address has Apt number it should NOT be parsed separate but included in address fields. Here are examples we get of data downloaded daily...the main common theme is CA is consistent: 15629 FAIRFORD AVENUE, NORWALK, CA 90650 5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003 Vacant Land, Long Beach, CA 4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022 15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411 4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423 19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326 Example 1 is typical easy parse, majority of data. Example 2 the word STREET should not have been separated by comma. Example 3 has no zip code or street numbers. Example 4 has dash and / in street numbers and (AREA) should be removed. Example 5 (VAN NUYS AREA) should be removed. Example 6 has extra comma in address at front. Example 7 should have UNIT 4 glued to address, and extra commas need removal. If someone can write something to handle all of this, you are awsome! If you can get everything except example 3 that is ok....I can strip those out ahead of time...they either say VACANT LAND or RAW LAND...but I would like to have them if possible. Thanks in advance! |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Is this data text data before it was imported into excel. It appears this
data went through some program that created this mess. Do you have the data before this mess was created. there is no good way of getting STREET out of the 2nd example unless you search for certain key words being by themselves lis Street, Ave. Avenue. You included only a small portion of the lines so without seeing a lot of data I can't predict what types of algorithms will or will not work. I would be guessing. If this data was hand typed and we are trying to correct human errors than no algorithm will work all the time because humans are not predictable, only machines are predictable. We can only try are best to fix as many problems as possible with a macro when humans are the causes of the problems. "Carl S." wrote: Someone brilliant have code to parse messy Address City State ZIP into 4 separate fields? If address has Apt number it should NOT be parsed separate but included in address fields. Here are examples we get of data downloaded daily...the main common theme is CA is consistent: 15629 FAIRFORD AVENUE, NORWALK, CA 90650 5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003 Vacant Land, Long Beach, CA 4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022 15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411 4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423 19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326 Example 1 is typical easy parse, majority of data. Example 2 the word STREET should not have been separated by comma. Example 3 has no zip code or street numbers. Example 4 has dash and / in street numbers and (AREA) should be removed. Example 5 (VAN NUYS AREA) should be removed. Example 6 has extra comma in address at front. Example 7 should have UNIT 4 glued to address, and extra commas need removal. If someone can write something to handle all of this, you are awsome! If you can get everything except example 3 that is ok....I can strip those out ahead of time...they either say VACANT LAND or RAW LAND...but I would like to have them if possible. Thanks in advance! |
#3
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
On Wed, 30 Sep 2009 00:24:01 -0700, Carl S.
wrote: Someone brilliant have code to parse messy Address City State ZIP into 4 separate fields? If address has Apt number it should NOT be parsed separate but included in address fields. Here are examples we get of data downloaded daily...the main common theme is CA is consistent: 15629 FAIRFORD AVENUE, NORWALK, CA 90650 5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003 Vacant Land, Long Beach, CA 4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022 15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411 4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423 19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326 Example 1 is typical easy parse, majority of data. Example 2 the word STREET should not have been separated by comma. Example 3 has no zip code or street numbers. Example 4 has dash and / in street numbers and (AREA) should be removed. Example 5 (VAN NUYS AREA) should be removed. Example 6 has extra comma in address at front. Example 7 should have UNIT 4 glued to address, and extra commas need removal. If someone can write something to handle all of this, you are awsome! If you can get everything except example 3 that is ok....I can strip those out ahead of time...they either say VACANT LAND or RAW LAND...but I would like to have them if possible. Thanks in advance! For the examples you show, this can be done using regular expressions. In the sample code below, you can either select the range to parse and then run the macro; or you can use the Functions individually to dynamically extract the different portions. The latter would be appropriate if you will be editing the basic list, and want to see changes dynamically. Notes: 1. I used early binding, so you MUST SET A REFERENCE (see tools/references) to Microsoft VBScript Regulare Expressions 5.5. If you are going to distribute this, you might want to use late binding instead. 2. So far as the extraneous commas are concerned, the Address function will only remove them if they are at the end of the string, or if they are followed by the word Street or Ave. This may or may not be satisfactory for you. 3. Zip codes are tested for being "proper". They can be either 9 digits, 5 digits, or in the pattern of 00000-0000 4. If your patterns don't conform roughly to what you have posted, then the regex may fail, and return blanks. If this is a problem, you'll need to post more examples. ---------------------------------------------------- Option Explicit Dim re As RegExp, mc As MatchCollection Sub ParseAddr() Dim rg As Range, c As Range Dim s As String 'set up range to parse 'could be done differently Set rg = Selection Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear For Each c In rg s = c.Text c.Offset(0, 1).Value = Addr(s) c.Offset(0, 2).Value = City(s) c.Offset(0, 3).Value = State(s) 'format as text so don't drop leading "0's" c.Offset(0, 4).NumberFormat = "@" c.Offset(0, 4).Value = Zip(s) Next c End Sub '-------------------------------------- Function Addr(s As String) As String Dim sTemp As String Set re = New RegExp re.IgnoreCase = True re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0) 'strip out some of the commas re.Pattern = ",(?=$|\s+(Street|Ave))" Addr = re.Replace(sTemp, "") End If Set re = Nothing End Function '----------------------------------------- Function City(s As String) As String Dim sTemp As String Set re = New RegExp re.Pattern = ",\s*([^,]+),[^,]+$" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0).SubMatches(0) 'strip out text enclosed by parentheses re.Pattern = "\s*\([^)]+\)\s*" City = re.Replace(sTemp, "") End If Set re = Nothing End Function '-------------------------------------- Function State(s As String) As String Set re = New RegExp re.Pattern = _ "([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)" If re.Test(s) = True Then Set mc = re.Execute(s) State = mc(0) End If Set re = Nothing End Function '---------------------------- Function Zip(s As String) As String Set re = New RegExp re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$" If re.Test(s) = True Then Set mc = re.Execute(s) Zip = mc(0) End If Set re = Nothing End Function ========================================== --ron |
#4
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Hey Joel...if routine can handle the examples given that will handle 99.5% of
the data and I will throw out the rest. It does not have to be 100%. Currently I'm getting 90%. I can write something, it will take me some time....but I thought there were some experienced guys out there who could slam dunk this no problem. I can do it easy in my other language, but am new to Excel VBA. "Joel" wrote: Is this data text data before it was imported into excel. It appears this data went through some program that created this mess. Do you have the data before this mess was created. there is no good way of getting STREET out of the 2nd example unless you search for certain key words being by themselves lis Street, Ave. Avenue. You included only a small portion of the lines so without seeing a lot of data I can't predict what types of algorithms will or will not work. I would be guessing. If this data was hand typed and we are trying to correct human errors than no algorithm will work all the time because humans are not predictable, only machines are predictable. We can only try are best to fix as many problems as possible with a macro when humans are the causes of the problems. "Carl S." wrote: Someone brilliant have code to parse messy Address City State ZIP into 4 separate fields? If address has Apt number it should NOT be parsed separate but included in address fields. Here are examples we get of data downloaded daily...the main common theme is CA is consistent: 15629 FAIRFORD AVENUE, NORWALK, CA 90650 5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003 Vacant Land, Long Beach, CA 4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022 15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411 4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423 19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326 Example 1 is typical easy parse, majority of data. Example 2 the word STREET should not have been separated by comma. Example 3 has no zip code or street numbers. Example 4 has dash and / in street numbers and (AREA) should be removed. Example 5 (VAN NUYS AREA) should be removed. Example 6 has extra comma in address at front. Example 7 should have UNIT 4 glued to address, and extra commas need removal. If someone can write something to handle all of this, you are awsome! If you can get everything except example 3 that is ok....I can strip those out ahead of time...they either say VACANT LAND or RAW LAND...but I would like to have them if possible. Thanks in advance! |
#5
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Parse it back-to-front
Take off the zip, if it exists plus any spaces and comma then the state, if it exists, and any space/comma before it then look at the next comma-separated section. If it exists, take it and remove parenthetical text as the city. What's left is the address... This seems to meet your requirements. You can also vet the state by word, abbrev, or start-of-word. You can vet the city from a larger list, as well as lookup city by zip, given the right list. |
#6
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Thanks! I have no idea what early or late binding is, and I may need to
export the module and send it to a few other offices, and would like it to be just a straight import to work...will binding affect this? There is a lot more to Excel VBA than I have brushed with. Bottom line...I was hoping for something like Cell A1 has address, after running macro cells A1, A2, A3, and A4 will have parsed data. Your answer goes over my head...I think I will write something using MID, LEFT, FIND, SUBSTITUTE, ISNUMBER, etc. I understand it much easier. CA will be chunk I start with, ZIP will be chunk to right, City will be chunk to left, and chunk before city will be address and I will strip all commas out...don't need them. Appriciate your effort though. "Ron Rosenfeld" wrote: On Wed, 30 Sep 2009 00:24:01 -0700, Carl S. wrote: Someone brilliant have code to parse messy Address City State ZIP into 4 separate fields? If address has Apt number it should NOT be parsed separate but included in address fields. Here are examples we get of data downloaded daily...the main common theme is CA is consistent: 15629 FAIRFORD AVENUE, NORWALK, CA 90650 5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003 Vacant Land, Long Beach, CA 4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022 15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411 4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423 19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326 Example 1 is typical easy parse, majority of data. Example 2 the word STREET should not have been separated by comma. Example 3 has no zip code or street numbers. Example 4 has dash and / in street numbers and (AREA) should be removed. Example 5 (VAN NUYS AREA) should be removed. Example 6 has extra comma in address at front. Example 7 should have UNIT 4 glued to address, and extra commas need removal. If someone can write something to handle all of this, you are awsome! If you can get everything except example 3 that is ok....I can strip those out ahead of time...they either say VACANT LAND or RAW LAND...but I would like to have them if possible. Thanks in advance! For the examples you show, this can be done using regular expressions. In the sample code below, you can either select the range to parse and then run the macro; or you can use the Functions individually to dynamically extract the different portions. The latter would be appropriate if you will be editing the basic list, and want to see changes dynamically. Notes: 1. I used early binding, so you MUST SET A REFERENCE (see tools/references) to Microsoft VBScript Regulare Expressions 5.5. If you are going to distribute this, you might want to use late binding instead. 2. So far as the extraneous commas are concerned, the Address function will only remove them if they are at the end of the string, or if they are followed by the word Street or Ave. This may or may not be satisfactory for you. 3. Zip codes are tested for being "proper". They can be either 9 digits, 5 digits, or in the pattern of 00000-0000 4. If your patterns don't conform roughly to what you have posted, then the regex may fail, and return blanks. If this is a problem, you'll need to post more examples. ---------------------------------------------------- Option Explicit Dim re As RegExp, mc As MatchCollection Sub ParseAddr() Dim rg As Range, c As Range Dim s As String 'set up range to parse 'could be done differently Set rg = Selection Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear For Each c In rg s = c.Text c.Offset(0, 1).Value = Addr(s) c.Offset(0, 2).Value = City(s) c.Offset(0, 3).Value = State(s) 'format as text so don't drop leading "0's" c.Offset(0, 4).NumberFormat = "@" c.Offset(0, 4).Value = Zip(s) Next c End Sub '-------------------------------------- Function Addr(s As String) As String Dim sTemp As String Set re = New RegExp re.IgnoreCase = True re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0) 'strip out some of the commas re.Pattern = ",(?=$|\s+(Street|Ave))" Addr = re.Replace(sTemp, "") End If Set re = Nothing End Function '----------------------------------------- Function City(s As String) As String Dim sTemp As String Set re = New RegExp re.Pattern = ",\s*([^,]+),[^,]+$" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0).SubMatches(0) 'strip out text enclosed by parentheses re.Pattern = "\s*\([^)]+\)\s*" City = re.Replace(sTemp, "") End If Set re = Nothing End Function '-------------------------------------- Function State(s As String) As String Set re = New RegExp re.Pattern = _ "([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)" If re.Test(s) = True Then Set mc = re.Execute(s) State = mc(0) End If Set re = Nothing End Function '---------------------------- Function Zip(s As String) As String Set re = New RegExp re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$" If re.Test(s) = True Then Set mc = re.Execute(s) Zip = mc(0) End If Set re = Nothing End Function ========================================== --ron |
#7
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
Thanks, I am thinking along those lines too. Problem is I don't have to the
technical know how to do it easily. Kind of like knowing how a jet engine works but unable to write a manual on it in Japanese. Oh well. Guess I have to tough it out, nothing in life is free but I have got some great snippits here in the past. Problably take me two days to write the code. Was hoping someone had a routine for this they could slam dunk to me. "ras" wrote: Parse it back-to-front Take off the zip, if it exists plus any spaces and comma then the state, if it exists, and any space/comma before it then look at the next comma-separated section. If it exists, take it and remove parenthetical text as the city. What's left is the address... This seems to meet your requirements. You can also vet the state by word, abbrev, or start-of-word. You can vet the city from a larger list, as well as lookup city by zip, given the right list. |
#8
Posted to microsoft.public.excel.programming
|
|||
|
|||
Parsing Address, City ST ZIP
On Wed, 30 Sep 2009 11:40:06 -0700, Carl S.
wrote: Thanks! I have no idea what early or late binding is, and I may need to export the module and send it to a few other offices, and would like it to be just a straight import to work...will binding affect this? There is a lot more to Excel VBA than I have brushed with. Bottom line...I was hoping for something like Cell A1 has address, after running macro cells A1, A2, A3, and A4 will have parsed data. Your answer goes over my head...I think I will write something using MID, LEFT, FIND, SUBSTITUTE, ISNUMBER, etc. I understand it much easier. CA will be chunk I start with, ZIP will be chunk to right, City will be chunk to left, and chunk before city will be address and I will strip all commas out...don't need them. Appriciate your effort though. Well, below is the code re-written so you don't have to worry about the "binding" and can merely export it as is to wherever. It seems to work fine on the data you presented; removing both the stuff within parentheses as well as the comma after UNIT 4 in #7 and the comma before Street in #2. It also handles #3 properly. It's easy (actually easier) to remove all the commas in the address field, but I wasn't sure that's what you wanted. In any event, if you get tired of working with the text worksheet functions, give this a try. It can be tweaked pretty easily once you figure out the "rules". As I wrote before, you can either Select the range of addresses, and then run the macro to Parse out the segments; or you can use the individual UDF's. Good luck! ============================================= Option Explicit Dim re As Object, mc As Object Sub ParseAddr() Dim rg As Range, c As Range Dim s As String 'set up range to parse 'could be done differently Set rg = Selection Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear For Each c In rg s = c.Text c.Offset(0, 1).Value = Addr(s) c.Offset(0, 2).Value = City(s) c.Offset(0, 3).Value = State(s) 'format as text so don't drop leading "0's" c.Offset(0, 4).NumberFormat = "@" c.Offset(0, 4).Value = Zip(s) Next c End Sub '-------------------------------------- Function Addr(s As String) As String Dim sTemp As String Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0) 'strip out some of the commas re.Pattern = ",(?=$|\s+(Street|Ave))" Addr = re.Replace(sTemp, "") End If Set re = Nothing End Function '----------------------------------------- Function City(s As String) As String Dim sTemp As String Set re = CreateObject("vbscript.regexp") re.Pattern = ",\s*([^,]+),[^,]+$" If re.Test(s) = True Then Set mc = re.Execute(s) sTemp = mc(0).SubMatches(0) 'strip out text enclosed by parentheses re.Pattern = "\s*\([^)]+\)\s*" City = re.Replace(sTemp, "") End If Set re = Nothing End Function '-------------------------------------- Function State(s As String) As String Set re = CreateObject("vbscript.regexp") re.Pattern = _ "([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)" If re.Test(s) = True Then Set mc = re.Execute(s) State = mc(0) End If Set re = Nothing End Function '---------------------------- Function Zip(s As String) As String Set re = CreateObject("vbscript.regexp") re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$" If re.Test(s) = True Then Set mc = re.Execute(s) Zip = mc(0) End If Set re = Nothing End Function ==================================== --ron |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Seperate Name, Street Address & City, State Zip | Excel Worksheet Functions | |||
separate an address street/city,state,zip within a cell? | Excel Worksheet Functions | |||
parsing out an undelimited address and city | Excel Discussion (Misc queries) | |||
parsing out an undelimited address and city | Excel Discussion (Misc queries) | |||
long address list, name-address-city, listed vertically, how do y. | Excel Discussion (Misc queries) |