View Single Post
  #2   Report Post  
Posted to microsoft.public.excel.programming
Walter Briscoe Walter Briscoe is offline
external usenet poster
 
Posts: 279
Default Scraping web data

In message of Sun, 9 Nov 2014 22:04:48 in
microsoft.public.excel.programming, davidman
writes

I want to download data which is spread out over several web addresses
(www.website.com/id=1 etc), which cannot be exported to CSV - is there a
way to download this data over several thousand URLs into Excel
automatically overnight? I can use a regular web query to download the
data from one page but not from more than one because it will time out.

Can I use the InternetExplorer.Application method?


I have been doing this for some time.
I find that I need to take a snapshot of a page to work out what I need to do with
it. I tried doing 10 interactions in parallel, but my old laptop and Internet
connection did not have enough power.

The following may be helpful.
1) Code to write significant data from the current page to a file.
Public Sub Crawl(ByVal o As Object, Optional ByVal t As String = "") ' Typename(O)
varies
' Sub Crawl(ByVal T As String, ByVal O As Object)
' When T is empty, this recursive subroutine writes a description of
' an open web page passed in O to snapshot.txt in the default folder.
' T is not empty for a recursive call
Dim c As Object ' DispDOMChildrenCollection results in error 13 on
assignment
Static f As Long ' File number used to reference snapshot.txt
Dim i As Long ' Loop counter
Dim m As Object ' Synonym of C.Item(I)
Dim s As String ' Synonym of C.Item(I).innerHTML

ChDir ActiveWorkbook.Path ' Put output in current directory
Set c = o.childNodes
If t = "" Then
f = FreeFile
Open "snapshot.txt" For Output As #f
On Error Resume Next
Print #f, IIf(o.URL < "", o.URL, "Not a document")
On Error GoTo 0 ' Ignore errors
i = 0
For Each m In c
' Debug.Print "Ioutter = " & I
Crawl m, t & " " & i: i = i + 1
Next
Close #f
Exit Sub
End If

For Each m In c
If m.nodeName = "#text" Then
Print #f, t & ", " & i & (" value=""" & m.nodeValue & """")
Else
s = m.innerHTML
' Debug.Assert InStr(S, "LineCloseInfo") = 0
If Not InStr(s, "LineCloseInfo") = 0 Then
' Debug.Print "T = " & T & "; I = " & I
' Stop
End If
Print #f, t & ", " & i; " id=" & m.ID & " innerHTML=""" & s & """"
' Print #F, t & ", " & I; " id=" & M.ID & " innerText=""" & S & """"
' S = M.innerText
If s < "" Then Crawl m, t & ", " & i
End If
i = i + 1
Next m
End Sub

2) It is useful to navigate a tree, when there is no obvious named
access to the relevant data.
Public Function GetBranch(ByVal o As Object, ParamArray Path() As Variant) As Object
' Doc is HTMLDocument, Doc.Body is HTMLBody
Dim i As Integer
Dim t As Object
Dim v As Variant

Set t = o.childNodes
' Stop
For Each v In Path
' Debug.Print "i = " & i & ", v = " & v & ", t.Length = " & t.Length & ",
TypeName(t.Item(v)) = " & TypeName(t.Item(v)): Stop
' On Error Resume Next: DebugPrint "t.item(" & V & ").innertext = """ &
T.Item(V).innerText & """": On Error GoTo 0
Set t = t.Item(v).childNodes
i = i + 1
Next v
' Stop
Set GetBranch = t
End Function

Typical uses are
Dim SearchResult As String
SearchResult = GetBranch(Doc, 1, 1, 1, 0, 0, 4, 0, 1, 0, 1, 0, 13, 3, 0,
0).Item(54).href

and
Click GetBranch(Doc, 1, 1, 2, 5).Item(1) ' Grab Text Version

I use the following public variables:
Public Doc As HTMLDocument ' IE.Document Needs Tools/References/HTML Object
Library
Public IE As SHDocVw.InternetExplorer ' Needs Tools/References/Microsoft
Internet Controls

I find the following useful:
Public Sub TxURL(ByVal URL1 As String)
I am sure the OP can Google for appropriate code and report any problems.

HTH. ;)
--
Walter Briscoe