Home |
Search |
Today's Posts |
#1
|
|||
|
|||
Scraping web data
I want to download data which is spread out over several web addresses (www.website.com/id=1 etc), which cannot be exported to CSV - is there a way to download this data over several thousand URLs into Excel automatically overnight? I can use a regular web query to download the data from one page but not from more than one because it will time out.
Can I use the InternetExplorer.Application method? |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
Scraping web data
In message of Sun, 9 Nov 2014 22:04:48 in
microsoft.public.excel.programming, davidman writes I want to download data which is spread out over several web addresses (www.website.com/id=1 etc), which cannot be exported to CSV - is there a way to download this data over several thousand URLs into Excel automatically overnight? I can use a regular web query to download the data from one page but not from more than one because it will time out. Can I use the InternetExplorer.Application method? I have been doing this for some time. I find that I need to take a snapshot of a page to work out what I need to do with it. I tried doing 10 interactions in parallel, but my old laptop and Internet connection did not have enough power. The following may be helpful. 1) Code to write significant data from the current page to a file. Public Sub Crawl(ByVal o As Object, Optional ByVal t As String = "") ' Typename(O) varies ' Sub Crawl(ByVal T As String, ByVal O As Object) ' When T is empty, this recursive subroutine writes a description of ' an open web page passed in O to snapshot.txt in the default folder. ' T is not empty for a recursive call Dim c As Object ' DispDOMChildrenCollection results in error 13 on assignment Static f As Long ' File number used to reference snapshot.txt Dim i As Long ' Loop counter Dim m As Object ' Synonym of C.Item(I) Dim s As String ' Synonym of C.Item(I).innerHTML ChDir ActiveWorkbook.Path ' Put output in current directory Set c = o.childNodes If t = "" Then f = FreeFile Open "snapshot.txt" For Output As #f On Error Resume Next Print #f, IIf(o.URL < "", o.URL, "Not a document") On Error GoTo 0 ' Ignore errors i = 0 For Each m In c ' Debug.Print "Ioutter = " & I Crawl m, t & " " & i: i = i + 1 Next Close #f Exit Sub End If For Each m In c If m.nodeName = "#text" Then Print #f, t & ", " & i & (" value=""" & m.nodeValue & """") Else s = m.innerHTML ' Debug.Assert InStr(S, "LineCloseInfo") = 0 If Not InStr(s, "LineCloseInfo") = 0 Then ' Debug.Print "T = " & T & "; I = " & I ' Stop End If Print #f, t & ", " & i; " id=" & m.ID & " innerHTML=""" & s & """" ' Print #F, t & ", " & I; " id=" & M.ID & " innerText=""" & S & """" ' S = M.innerText If s < "" Then Crawl m, t & ", " & i End If i = i + 1 Next m End Sub 2) It is useful to navigate a tree, when there is no obvious named access to the relevant data. Public Function GetBranch(ByVal o As Object, ParamArray Path() As Variant) As Object ' Doc is HTMLDocument, Doc.Body is HTMLBody Dim i As Integer Dim t As Object Dim v As Variant Set t = o.childNodes ' Stop For Each v In Path ' Debug.Print "i = " & i & ", v = " & v & ", t.Length = " & t.Length & ", TypeName(t.Item(v)) = " & TypeName(t.Item(v)): Stop ' On Error Resume Next: DebugPrint "t.item(" & V & ").innertext = """ & T.Item(V).innerText & """": On Error GoTo 0 Set t = t.Item(v).childNodes i = i + 1 Next v ' Stop Set GetBranch = t End Function Typical uses are Dim SearchResult As String SearchResult = GetBranch(Doc, 1, 1, 1, 0, 0, 4, 0, 1, 0, 1, 0, 13, 3, 0, 0).Item(54).href and Click GetBranch(Doc, 1, 1, 2, 5).Item(1) ' Grab Text Version I use the following public variables: Public Doc As HTMLDocument ' IE.Document Needs Tools/References/HTML Object Library Public IE As SHDocVw.InternetExplorer ' Needs Tools/References/Microsoft Internet Controls I find the following useful: Public Sub TxURL(ByVal URL1 As String) I am sure the OP can Google for appropriate code and report any problems. HTH. ;) -- Walter Briscoe |
#3
|
|||
|
|||
Sorry Walter, I didn't understand any of that (I struggle to get my head around IE automation at the best of times).
I've looked at the source code, it turns out I need to go to the URL, and copy all into an Excel sheet (for a ID number 1 to 1000) because its a Javascript table which looks up the table on the internal server. |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
scraping PDFs | Excel Programming | |||
Web Scraping With Loops | Excel Programming | |||
Web scraping | Excel Worksheet Functions | |||
Web Screen Scraping | Excel Programming |