ExcelBanter

ExcelBanter (https://www.excelbanter.com/)
-   Excel Programming (https://www.excelbanter.com/excel-programming/)
-   -   Extracting web data (https://www.excelbanter.com/excel-programming/444604-re-extracting-web-data.html)

ron

Extracting web data
 
Peter...This will do most of the work. But separating the city from the address is problematical. If the city is more than one word, this macro won't handle it correctly. I'm out of time so see how this works for you...Ron

Sub Webpage()
' Retrieve the source code for the first page
J = 1
my_url = "http://content.usatoday.com/news/nation/environment/smokestack/search/FL/~/~/~/rank/~/" & J & "/%5B/url%5D"
Set my_obj = CreateObject("MSXML2.XMLHTTP")
my_obj.Open "GET", my_url, False
my_obj.send
my_var = my_obj.responsetext
Set my_obj = Nothing

' Determine the number of pages to examine
pos_1 = InStr(1, my_var, "search-intro", vbTextCompare)
pos_2 = InStr(pos_1, my_var, "", vbTextCompare)
pos_3 = InStr(1 + pos_2, my_var, "", vbTextCompare)
pos_4 = InStr(pos_3, my_var, "Schools", vbTextCompare)
no_pages = (Mid(my_var, 1 + pos_3, -1 + pos_4 - (1 + pos_3)))
no_pages = Replace(no_pages, ",", "", 1, -1, vbTextCompare)
no_pages = Val(no_pages)
If (no_pages Mod 10) = 0 Then
no_pages = no_pages / 10
Else
no_pages = 1 + Int(no_pages / 10)
End If

' Begin iteration
For J = 1 To no_pages
my_url = "http://content.usatoday.com/news/nation/environment/smokestack/search/FL/~/~/~/rank/~/" & J & "/%5B/url%5D"
Set my_obj = CreateObject("MSXML2.XMLHTTP")
my_obj.Open "GET", my_url, False
my_obj.send
my_var = my_obj.responsetext
Set my_obj = Nothing

' Extract data
yy = 1
Do Until yy = 0
pos_5 = InStr(yy, my_var, "smokestack/school", vbTextCompare)
pos_6 = InStr(pos_5, my_var, "", vbTextCompare)
pos_7 = InStr(pos_6, my_var, "<", vbTextCompare)
sc_name = Mid(my_var, 1 + pos_6, pos_7 - (1 + pos_6))

pos_10 = InStr(pos_7, my_var, "p", vbTextCompare)
pos_11 = InStr(pos_10, my_var, "</p", vbTextCompare)
addr_city_state = Mid(my_var, 2 + pos_10, pos_11 - (2 + pos_10))
' extract the two letter state designation
sc_state = Right(addr_city_state, 2)
' separate the address and the city
jj = InStrRev(addr_city_state, ",", -1, vbTextCompare)
kk = InStrRev(addr_city_state, " ", jj, vbTextCompare)
sc_address = Left(addr_city_state, kk - 1)
sc_city = Trim(Mid(addr_city_state, kk, jj - kk))

pos_15 = InStrRev(my_var, "air is worse", pos_5, vbTextCompare)
pos_16 = InStr(pos_15, my_var, "at", vbTextCompare)
pos_17 = InStr(pos_16, my_var, "school", vbTextCompare)
sc_rank = Mid(my_var, 3 + pos_16, -1 + pos_17 - (3 + pos_16))

yy = InStr(pos_7, my_var, "smokestack/school", vbTextCompare)

' Put the current school data into the workbook
ActiveCell = sc_name
ActiveCell.Offset(0, 1) = sc_address
ActiveCell.Offset(0, 2) = sc_city
ActiveCell.Offset(0, 3) = sc_state
ActiveCell.Offset(0, 4) = sc_rank

ActiveCell.Offset(1, 0).Select
Loop
Next

Range("A:E").Select
Selection.Columns.AutoFit
Range("A1").Select
End Sub

PeterSo

Extracting web data
 
On May 19, 5:44*pm, ron wrote:
Peter...This will do most of the work. *But separating the city from the address is problematical. *If the city is more than one word, this macro won't handle it correctly. *I'm out of time so see how this works for you...Ron

Sub Webpage()
' Retrieve the source code for the first page
* * J = 1
* * my_url = "http://content.usatoday.com/news/nation/environment/smokestack/search..." & J & "/%5B/url%5D"
* * Set my_obj = CreateObject("MSXML2.XMLHTTP")
* * my_obj.Open "GET", my_url, False
* * my_obj.send
* * my_var = my_obj.responsetext
* * Set my_obj = Nothing

' Determine the number of pages to examine
* * pos_1 = InStr(1, my_var, "search-intro", vbTextCompare)
* * pos_2 = InStr(pos_1, my_var, "", vbTextCompare)
* * pos_3 = InStr(1 + pos_2, my_var, "", vbTextCompare)
* * pos_4 = InStr(pos_3, my_var, "Schools", vbTextCompare)
* * no_pages = (Mid(my_var, 1 + pos_3, -1 + pos_4 - (1 + pos_3)))
* * no_pages = Replace(no_pages, ",", "", 1, -1, vbTextCompare)
* * no_pages = Val(no_pages)
* * If (no_pages Mod 10) = 0 Then
* * * * no_pages = no_pages / 10
* * Else
* * * * no_pages = 1 + Int(no_pages / 10)
* * End If

' Begin iteration
* * For J = 1 To no_pages
* * * * my_url = "http://content.usatoday.com/news/nation/environment/smokestack/search..." & J & "/%5B/url%5D"
* * * * Set my_obj = CreateObject("MSXML2.XMLHTTP")
* * * * my_obj.Open "GET", my_url, False
* * * * my_obj.send
* * * * my_var = my_obj.responsetext
* * * * Set my_obj = Nothing

' Extractdata
* * * * yy = 1
* * * * Do Until yy = 0
* * * * * * pos_5 = InStr(yy, my_var, "smokestack/school", vbTextCompare)
* * * * * * pos_6 = InStr(pos_5, my_var, "", vbTextCompare)
* * * * * * pos_7 = InStr(pos_6, my_var, "<", vbTextCompare)
* * * * * * sc_name = Mid(my_var, 1 + pos_6, pos_7 - (1 + pos_6))

* * * * * * pos_10 = InStr(pos_7, my_var, "p", vbTextCompare)
* * * * * * pos_11 = InStr(pos_10, my_var, "</p", vbTextCompare)
* * * * * * addr_city_state = Mid(my_var, 2 + pos_10, pos_11 - (2 + pos_10))
* * * * * * ' extract the two letter state designation
* * * * * * sc_state = Right(addr_city_state, 2)
* * * * * * ' separate the address and the city
* * * * * * jj = InStrRev(addr_city_state, ",", -1, vbTextCompare)
* * * * * * kk = InStrRev(addr_city_state, " ", jj, vbTextCompare)
* * * * * * sc_address = Left(addr_city_state, kk - 1)
* * * * * * sc_city = Trim(Mid(addr_city_state, kk, jj - kk))

* * * * * * pos_15 = InStrRev(my_var, "air is worse", pos_5, vbTextCompare)
* * * * * * pos_16 = InStr(pos_15, my_var, "at", vbTextCompare)
* * * * * * pos_17 = InStr(pos_16, my_var, "school", vbTextCompare)
* * * * * * sc_rank = Mid(my_var, 3 + pos_16, -1 + pos_17 - (3 + pos_16))

* * * * * * yy = InStr(pos_7, my_var, "smokestack/school", vbTextCompare)

' Put the current schooldatainto the workbook
* * * * * * ActiveCell = sc_name
* * * * * * ActiveCell.Offset(0, 1) = sc_address
* * * * * * ActiveCell.Offset(0, 2) = sc_city
* * * * * * ActiveCell.Offset(0, 3) = sc_state
* * * * * * ActiveCell.Offset(0, 4) = sc_rank

* * * * * * ActiveCell.Offset(1, 0).Select
* * * * Loop
* * Next

* * Range("A:E").Select
* * Selection.Columns.AutoFit
* * Range("A1").Select
End Sub


Ron:
Great. I will look into if there is a way to separate the cities with
more than one word from the address. Again, thank you very much for
your help. The weekend looks better already :)

Peter


All times are GMT +1. The time now is 04:55 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
ExcelBanter.com