Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
Thanks to dicks-blog.com I've just recently managed to find out how to access
web pages and bring the data back to Excel (2000) VBA even when it's not in a table. The possibilities are very exciting. But I have run into two vexing problems. 1) The same code that retrieves the collection of links on a page -- essentially, Document.Links -- works with two homepages and fails on a third. The links on the third look the same to me -- static links with plain text anchors. 2) One of my tasks is to list out the links in a sort of tree structure. It would make sense to do this recursively, up to a user-specified number of levels deep. But I started with a non-recursive model, figuring it would be easier to debug. Moving to the second level, I load the page that the first link points to, and grab ITS link collection, storing that in a different variable. I process the page and come back to the first level. At that point, trying to access the links collection so I can process the next link gives me an Error 70, Permission Denied. This happens even if I reload the first-level page and then try to re-retrieve the collection. Since I've already stored the links collection in a variable, it shouldn't matter that I've left the page where I got it from, should it? And why can't I re-retrieve it even if I reload the page? Reloading is an abuse of bandwidth and would make the macro take much longer to run, so I really want to preserve the loaded links collection and just keep referring to it. TIA, Gregg Roberts |
#2
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
Gergg,
Hard to tell exactly what the problem is without seeing any code. Are you automating IE to do your work ? Tim "Gregg Roberts" wrote in message ... Thanks to dicks-blog.com I've just recently managed to find out how to access web pages and bring the data back to Excel (2000) VBA even when it's not in a table. The possibilities are very exciting. But I have run into two vexing problems. 1) The same code that retrieves the collection of links on a page -- essentially, Document.Links -- works with two homepages and fails on a third. The links on the third look the same to me -- static links with plain text anchors. 2) One of my tasks is to list out the links in a sort of tree structure. It would make sense to do this recursively, up to a user-specified number of levels deep. But I started with a non-recursive model, figuring it would be easier to debug. Moving to the second level, I load the page that the first link points to, and grab ITS link collection, storing that in a different variable. I process the page and come back to the first level. At that point, trying to access the links collection so I can process the next link gives me an Error 70, Permission Denied. This happens even if I reload the first-level page and then try to re-retrieve the collection. Since I've already stored the links collection in a variable, it shouldn't matter that I've left the page where I got it from, should it? And why can't I re-retrieve it even if I reload the page? Reloading is an abuse of bandwidth and would make the macro take much longer to run, so I really want to preserve the loaded links collection and just keep referring to it. TIA, Gregg Roberts |
#3
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
Hard to tell exactly what the problem is without seeing any code.
Are you automating IE to do your work ? Yes. I solved the first problem, which was occurring because it was a frames site. Here's the relevant code for the second problem: IeApp.Navigate LinkURL Do Loop Until IeApp.ReadyState = READYSTATE_COMPLETE Set IEDoc = IeApp.Document <snip Set LinkTags1 = IEDoc.links <snip CurRow = 3 CurCol = 1 MsgBox IsEmpty(LinkTags1) ' <-- This comes back False MsgBox LinkTags1.Length ' <-- Error 70 occurs here on the SECOND iteration... For LinkIndex1 = 0 To LinkTags1.Length - 1 ' <-- ...and here. CurRow = CurRow + 1 LinkURL = LinkTags1(LinkIndex1).href If LinkURL < "" Then ' An A tag with blank href is a bookmark, so ignore Range(Cells(CurRow, CurCol), Cells(CurRow, CurCol)).Value = _ LinkTags1(LinkIndex1).innerText Range(Cells(CurRow, CurCol + 1), Cells(CurRow, CurCol + 1)).Value = _ LinkURL ' if unique and internal, then get links at next level down If URL_Is_Unique(LinkURL) And URL_Is_Internal(LinkURL) Then IeApp.Navigate LinkURL Do Loop Until IeApp.ReadyState = READYSTATE_COMPLETE Set IEDoc = IeApp.Document Set LinkTags2 = IEDoc.links <snip ... process second-level page and return to top |
#4
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
I'm not certain you can still access the links collection if the page has
already been unloaded. You might have to copy the links into a collection, dictionary or array while the page is still loaded, and only then loop through them to get the sub-links. Tim "Gregg Roberts" wrote in message ... Hard to tell exactly what the problem is without seeing any code. Are you automating IE to do your work ? Yes. I solved the first problem, which was occurring because it was a frames site. Here's the relevant code for the second problem: IeApp.Navigate LinkURL Do Loop Until IeApp.ReadyState = READYSTATE_COMPLETE Set IEDoc = IeApp.Document <snip Set LinkTags1 = IEDoc.links <snip CurRow = 3 CurCol = 1 MsgBox IsEmpty(LinkTags1) ' <-- This comes back False MsgBox LinkTags1.Length ' <-- Error 70 occurs here on the SECOND iteration... For LinkIndex1 = 0 To LinkTags1.Length - 1 ' <-- ...and here. CurRow = CurRow + 1 LinkURL = LinkTags1(LinkIndex1).href If LinkURL < "" Then ' An A tag with blank href is a bookmark, so ignore Range(Cells(CurRow, CurCol), Cells(CurRow, CurCol)).Value = _ LinkTags1(LinkIndex1).innerText Range(Cells(CurRow, CurCol + 1), Cells(CurRow, CurCol + 1)).Value = _ LinkURL ' if unique and internal, then get links at next level down If URL_Is_Unique(LinkURL) And URL_Is_Internal(LinkURL) Then IeApp.Navigate LinkURL Do Loop Until IeApp.ReadyState = READYSTATE_COMPLETE Set IEDoc = IeApp.Document Set LinkTags2 = IEDoc.links <snip ... process second-level page and return to top |
#5
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
Tim,
This is awesome! Thank you so much. Since you've been so helpful, I'm going to ask you a few more questions. Obviously you're free to say "enough" at any point. I think these additional questions still fit my subject line... 1. Are there other ReadyStates that I should handle besides READYSTATE_COMPLETE? What if an OK/Cancel prompt appears upon navigation, such as "You are about to visit a page that has unsecure items..." Will that give ReadyState = READYSTATE_INTERACTIVE? What about a 404 error -- will that still give READYSTATE_COMPLETE? 2. What if a link href is to a non-HTML file such as a .WMV, .DOC, or .PDF? I don't want the macro to try to visit such links, even though it should not cause an error on my machine. Short of building and maintaining a list of what file types to visit or not, how do I determine whether to navigate to the URL? Parsing the file extension is not straightforward because of active pages, which can contain spurious "file extension" matches. I really want this code to be robust and there's a lot to think about. Gregg Roberts P.S. I was already saving the links collection to a variable, LinkTags1 -- not just accessing it via IEDoc.Links. However, if I navigate to a new page without setting IEDoc = IEApp.Document, why would IEDoc.Links be affected at all -- much less be made *inaccessible*? Remember, the error message was Permission Denied, not Object Required, Object Does Not Exist, or some more generic error. P.P.S. Where can I read more about how all this works? The MSDN site seems to refer to a lot of environments I am not using, hence the examples don't work. |
#6
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
Gregg,
I'm at work right now so can't give you any comprehensive advice on how to proceed. It's enough to say that what you're trying to do isn't very simple if you want to be able to handle all possible outcomes. You can easily check what happens with a 404 just by starting with a non-existent URL. I haven't tried with pages which pop up a prompt so couln't say what will happen other than your routine will hang at this point. File extensions: very tricky given the number of possible values. You could decide only to follow htm, html, shtml, etc and ignore any others, but as you note it's not always possible to tell from looking at the URL what will be returned. As you note, it shouldn't cause an error, but not much point in following refs which will not have links. Re your saving of the links collection: you have to actually copy the links to another object: by just setting a variable to point to the links collection you are just creating a *reference* to the collection, not making a copy of it. Since the collection only exists in the context of the loaded page, you will lose your connection when you reload the page. "Access denied" is the typical response to cross-domain access violations so IE may have interpreted your actions in this context. Tim -- Tim Williams Palo Alto, CA "Gregg Roberts" wrote in message ... Tim, This is awesome! Thank you so much. Since you've been so helpful, I'm going to ask you a few more questions. Obviously you're free to say "enough" at any point. I think these additional questions still fit my subject line... 1. Are there other ReadyStates that I should handle besides READYSTATE_COMPLETE? What if an OK/Cancel prompt appears upon navigation, such as "You are about to visit a page that has unsecure items..." Will that give ReadyState = READYSTATE_INTERACTIVE? What about a 404 error -- will that still give READYSTATE_COMPLETE? 2. What if a link href is to a non-HTML file such as a .WMV, .DOC, or ..PDF? I don't want the macro to try to visit such links, even though it should not cause an error on my machine. Short of building and maintaining a list of what file types to visit or not, how do I determine whether to navigate to the URL? Parsing the file extension is not straightforward because of active pages, which can contain spurious "file extension" matches. I really want this code to be robust and there's a lot to think about. Gregg Roberts P.S. I was already saving the links collection to a variable, LinkTags1 -- not just accessing it via IEDoc.Links. However, if I navigate to a new page without setting IEDoc = IEApp.Document, why would IEDoc.Links be affected at all -- much less be made *inaccessible*? Remember, the error message was Permission Denied, not Object Required, Object Does Not Exist, or some more generic error. P.P.S. Where can I read more about how all this works? The MSDN site seems to refer to a lot of environments I am not using, hence the examples don't work. |
#7
Posted to microsoft.public.excel.programming
|
|||
|
|||
HTML Document.Links Issues
You can easily check what happens with a 404 just by starting with a
non-existent URL Duh. Good idea. I could do the same with a page that always results in a prompt, if I can find one. by just setting a variable to point to the links collection you are just creating a *reference* to the collection, not making a copy of it. Since the collection only exists in the context of the loaded page, you will lose your connection when you reload the page. I thought that might be it. Thanks for all the help. Gregg Roberts |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
links issues | Excel Discussion (Misc queries) | |||
Opening Excel Document in ASP.NET and having issues...please help | Excel Programming | |||
Update links - hyperlink issues | Excel Discussion (Misc queries) | |||
Links in html-file | Excel Discussion (Misc queries) | |||
Get a value from a table in an HTML document | Excel Programming |