Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Junior Member
 
Posts: 8
Default Scraping web data

I want to download data which is spread out over several web addresses (www.website.com/id=1 etc), which cannot be exported to CSV - is there a way to download this data over several thousand URLs into Excel automatically overnight? I can use a regular web query to download the data from one page but not from more than one because it will time out.

Can I use the InternetExplorer.Application method?
  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 279
Default Scraping web data

In message of Sun, 9 Nov 2014 22:04:48 in
microsoft.public.excel.programming, davidman
writes

I want to download data which is spread out over several web addresses
(www.website.com/id=1 etc), which cannot be exported to CSV - is there a
way to download this data over several thousand URLs into Excel
automatically overnight? I can use a regular web query to download the
data from one page but not from more than one because it will time out.

Can I use the InternetExplorer.Application method?


I have been doing this for some time.
I find that I need to take a snapshot of a page to work out what I need to do with
it. I tried doing 10 interactions in parallel, but my old laptop and Internet
connection did not have enough power.

The following may be helpful.
1) Code to write significant data from the current page to a file.
Public Sub Crawl(ByVal o As Object, Optional ByVal t As String = "") ' Typename(O)
varies
' Sub Crawl(ByVal T As String, ByVal O As Object)
' When T is empty, this recursive subroutine writes a description of
' an open web page passed in O to snapshot.txt in the default folder.
' T is not empty for a recursive call
Dim c As Object ' DispDOMChildrenCollection results in error 13 on
assignment
Static f As Long ' File number used to reference snapshot.txt
Dim i As Long ' Loop counter
Dim m As Object ' Synonym of C.Item(I)
Dim s As String ' Synonym of C.Item(I).innerHTML

ChDir ActiveWorkbook.Path ' Put output in current directory
Set c = o.childNodes
If t = "" Then
f = FreeFile
Open "snapshot.txt" For Output As #f
On Error Resume Next
Print #f, IIf(o.URL < "", o.URL, "Not a document")
On Error GoTo 0 ' Ignore errors
i = 0
For Each m In c
' Debug.Print "Ioutter = " & I
Crawl m, t & " " & i: i = i + 1
Next
Close #f
Exit Sub
End If

For Each m In c
If m.nodeName = "#text" Then
Print #f, t & ", " & i & (" value=""" & m.nodeValue & """")
Else
s = m.innerHTML
' Debug.Assert InStr(S, "LineCloseInfo") = 0
If Not InStr(s, "LineCloseInfo") = 0 Then
' Debug.Print "T = " & T & "; I = " & I
' Stop
End If
Print #f, t & ", " & i; " id=" & m.ID & " innerHTML=""" & s & """"
' Print #F, t & ", " & I; " id=" & M.ID & " innerText=""" & S & """"
' S = M.innerText
If s < "" Then Crawl m, t & ", " & i
End If
i = i + 1
Next m
End Sub

2) It is useful to navigate a tree, when there is no obvious named
access to the relevant data.
Public Function GetBranch(ByVal o As Object, ParamArray Path() As Variant) As Object
' Doc is HTMLDocument, Doc.Body is HTMLBody
Dim i As Integer
Dim t As Object
Dim v As Variant

Set t = o.childNodes
' Stop
For Each v In Path
' Debug.Print "i = " & i & ", v = " & v & ", t.Length = " & t.Length & ",
TypeName(t.Item(v)) = " & TypeName(t.Item(v)): Stop
' On Error Resume Next: DebugPrint "t.item(" & V & ").innertext = """ &
T.Item(V).innerText & """": On Error GoTo 0
Set t = t.Item(v).childNodes
i = i + 1
Next v
' Stop
Set GetBranch = t
End Function

Typical uses are
Dim SearchResult As String
SearchResult = GetBranch(Doc, 1, 1, 1, 0, 0, 4, 0, 1, 0, 1, 0, 13, 3, 0,
0).Item(54).href

and
Click GetBranch(Doc, 1, 1, 2, 5).Item(1) ' Grab Text Version

I use the following public variables:
Public Doc As HTMLDocument ' IE.Document Needs Tools/References/HTML Object
Library
Public IE As SHDocVw.InternetExplorer ' Needs Tools/References/Microsoft
Internet Controls

I find the following useful:
Public Sub TxURL(ByVal URL1 As String)
I am sure the OP can Google for appropriate code and report any problems.

HTH. ;)
--
Walter Briscoe
  #3   Report Post  
Junior Member
 
Posts: 8
Default

Sorry Walter, I didn't understand any of that (I struggle to get my head around IE automation at the best of times).

I've looked at the source code, it turns out I need to go to the URL, and copy all into an Excel sheet (for a ID number 1 to 1000) because its a Javascript table which looks up the table on the internal server.
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
scraping PDFs Jeff[_66_] Excel Programming 0 August 12th 11 02:46 PM
Web Scraping With Loops qcan Excel Programming 2 February 25th 08 10:26 PM
Web scraping mickbarry Excel Worksheet Functions 2 February 1st 06 10:20 AM
Web Screen Scraping scottymelloty[_19_] Excel Programming 0 November 29th 05 01:53 PM


All times are GMT +1. The time now is 02:59 PM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"