LinkBack Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
  #1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 100
Default Recursively scraping web pages for embedded links and files

This is a followup to a post from yesterday (Thanks to Tim Williams for
responding). I have more information now, and felt it warranted a second try
to see if there is way to do this now that we've gotten the documents
exposed via the web interface. Using XL2003 on WinXP.

We have a corporate web application that exposes various documents in
multiple levels of subdirectories. My belief is that these are stored in a
database, but now they are directly accessible via web links through this
web application, so where they come from hopefully doesn't affect what I am
trying to accomplish.

Starting from the main page of the web application, I need to scrape the
entire directory tree and capture some of the details (javascript links to
..doc and .pdf files that can be opened through IE6 via 'dedicated' URLs for
each document). I'm sure I'll have more questions once I start dissecting
the HTML, but for starters I need to understand how to even scrape multiple
levels within the directory tree of a website. I've copied in some of the
URLS (changed slightly for corporate security) to give a sense of what I'm
working with.

Top of tree:
http://ourserver.com/rtsa-bin/PermaS...=M%20S%20-%20L

I can click a link to go to the next level of subfolder:
http://ourserver.com/rtsa-bin/PermaS...ne&pagetitle=M

Third level of folder:
http://ourserver.com/rtsa-bin/PermaS...ec&pagetitle=M

and so on.

A sample link for a single document within one of the pages in the web
tree/directory is:
javascript:openDocument('0900043d802b3528');

where clicking that link ultimately opens:
http://ourserver.com/Documentation/03451TRs142.pdf

Ultimately I need to recreate all the links in an Excel workbook so users
can click on a hyperlink and access the relevant document. An Excel
hyperlink that uses the javascript:opendocument command is totally fine with
me, but first I need to collect them all. Alternatively I'll have to figure
out how to cycle through each javascript command anyway, then identify the
URL it opened (which sounds harder).

Any advice or code snippets greatly appreciated- I haven't done anything
with HTML at all.

Thanks,
Keith


 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Embedded Excel sheets over 2 pages MultiM Excel Worksheet Functions 2 March 11th 08 07:55 AM
Embedded external links Dave S. Excel Discussion (Misc queries) 0 February 21st 08 07:38 PM
Scraping/listing document URLs on a server that don't have web pages/existing links? Keith R[_2_] Excel Programming 2 February 14th 08 04:29 AM
How do I make an embedded excel spreadsheet flow over pages in wo Dragonalia Excel Discussion (Misc queries) 0 September 12th 06 02:41 AM
Running an add-in recursively on embedded files akullen[_2_] Excel Programming 1 June 17th 06 10:26 PM


All times are GMT +1. The time now is 03:03 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"