View Single Post
  #2   Report Post  
Posted to microsoft.public.excel.programming
Tim Williams Tim Williams is offline
external usenet poster
 
Posts: 1,588
Default Recursively scraping web pages for embedded links and files

Starting from the main page you could identify all of the "folder" links by
looking at the URL: each could be clicked in order to drill down into
subfolders, and each of these listed etc etc.

Grabbing URL's to the files will be more difficult: you'll have to
deconstruct the "openDocument()" javascript code to see how it determines
what URL to open. You can't use the javascript href directly in Excel: it
depends on having the js function available.

If you're new to working with HTML docs from Excel then it may be a long
haul. I can help you with specific points but can't provide a solution. If
you prefer you can follow up via email (tim j williams at gmail dot com: no
spaces, etc.).

Tim


"Ker_01" wrote in message
...
This is a followup to a post from yesterday (Thanks to Tim Williams for
responding). I have more information now, and felt it warranted a second
try to see if there is way to do this now that we've gotten the documents
exposed via the web interface. Using XL2003 on WinXP.

We have a corporate web application that exposes various documents in
multiple levels of subdirectories. My belief is that these are stored in a
database, but now they are directly accessible via web links through this
web application, so where they come from hopefully doesn't affect what I
am trying to accomplish.

Starting from the main page of the web application, I need to scrape the
entire directory tree and capture some of the details (javascript links to
.doc and .pdf files that can be opened through IE6 via 'dedicated' URLs
for each document). I'm sure I'll have more questions once I start
dissecting the HTML, but for starters I need to understand how to even
scrape multiple levels within the directory tree of a website. I've copied
in some of the URLS (changed slightly for corporate security) to give a
sense of what I'm working with.

Top of tree:
http://ourserver.com/rtsa-bin/PermaS...=M%20S%20-%20L

I can click a link to go to the next level of subfolder:
http://ourserver.com/rtsa-bin/PermaS...ne&pagetitle=M

Third level of folder:
http://ourserver.com/rtsa-bin/PermaS...ec&pagetitle=M

and so on.

A sample link for a single document within one of the pages in the web
tree/directory is:
javascript:openDocument('0900043d802b3528');

where clicking that link ultimately opens:
http://ourserver.com/Documentation/03451TRs142.pdf

Ultimately I need to recreate all the links in an Excel workbook so users
can click on a hyperlink and access the relevant document. An Excel
hyperlink that uses the javascript:opendocument command is totally fine
with me, but first I need to collect them all. Alternatively I'll have to
figure out how to cycle through each javascript command anyway, then
identify the URL it opened (which sounds harder).

Any advice or code snippets greatly appreciated- I haven't done anything
with HTML at all.

Thanks,
Keith