Home |
Search |
Today's Posts |
#1
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application (sample HTML snippet pasted at bottom of post): ^<A HREF=.* However, now I've found that I only need to capture and process certain links. The information that will determine whether a link needs to be processed is buried between the original link and the next link (or EOF), so I need to capture a larger (multiline) section of text and test each one to see if it contains my identifier. It appears that I'm safe using the </TR tag as something that always comes after my new identifier and before the next link (or EOF). So, I'm trying to edit my regex so I can grab this larger (multiline) section of text, then if the identifier is the correct one, I'll use my first regex expression or a slightly modified version to grab just the URL from within the match. I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source on regex expressions, but when I test my code on http://regexlib.com/RETester.aspx I'm getting no results (my first expression worked fine). Any assistance would be greatly appreciated. I think I'm pretty close, but the following isn't working: ^<A HREF=.*/TR Any advice? The only difference is replacing the single '' with '/TR'. I suspect it may have to do with spaces or linebreaks, but I don't know for certain. I'm posting a sample of my much larger HTML below; I'm trying to only capture the ^<A HREF=.* URL match for items where the class td includes "Land Spread Vector". I prefer using multiple simple Regex expressions versus one donated expression that does it all, so I can understand my own code and at least attempt to troubleshoot if I need to change anything. Thanks! Keith <A Href=javascript:openDocument('0900043d802b3528'); <img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16 101998 </a </td <td class='classtd' Green-tipped Martin </td <td class='classtd' CURRENT,3.2 </td </TR <TR <TD</TD <TD <A Href=javascript:openDocument('0900043d803a1ce4'); <img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16 101998 - APRRE - Assert.doc </a </td <td class='classtd' Land Spread Vector </td <td class='classtd' CURRENT,3.0 </td </TR <TR <TD</TD <TD <A Href=javascript:openDocument('0900043d802b635e'); <img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16 101998-R </a </td <td class='classtd' Reevaluation </td <td class='classtd' CURRENT,1.0 </td </TR </TD</TR</TABLE<BR<BR <CENTER <A Href='javascript:history.back();'<img src='/OurDir/images/back_down.jpg' border=0 align='center' alt='Back'</A <A Href='javascript:goHome();'<img src='/OurDir/images/home_down.jpg' border=0 align='center' alt='Home'</A </CENTER </BODY </HTML |
#2
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01" wrote:
I'm parsing an HTML file, and originally, I thought I only needed to capture all the links- the following worked well in my particular application (sample HTML snippet pasted at bottom of post): ^<A HREF=.* However, now I've found that I only need to capture and process certain links. The information that will determine whether a link needs to be processed is buried between the original link and the next link (or EOF), so I need to capture a larger (multiline) section of text and test each one to see if it contains my identifier. It appears that I'm safe using the </TR tag as something that always comes after my new identifier and before the next link (or EOF). So, I'm trying to edit my regex so I can grab this larger (multiline) section of text, then if the identifier is the correct one, I'll use my first regex expression or a slightly modified version to grab just the URL from within the match. I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source on regex expressions, but when I test my code on http://regexlib.com/RETester.aspx I'm getting no results (my first expression worked fine). Any assistance would be greatly appreciated. I think I'm pretty close, but the following isn't working: ^<A HREF=.*/TR Any advice? The only difference is replacing the single '' with '/TR'. I suspect it may have to do with spaces or linebreaks, but I don't know for certain. I'm posting a sample of my much larger HTML below; I'm trying to only capture the ^<A HREF=.* URL match for items where the class td includes "Land Spread Vector". I prefer using multiple simple Regex expressions versus one donated expression that does it all, so I can understand my own code and at least attempt to troubleshoot if I need to change anything. Thanks! Keith Your description and the data confuses me a bit. IT might be clearer to me if you posted exactly which links you expect to extract. However, two suggestions: 1. In VBA, dot (".") never matches newline. So if you want to devise an expression that will match across multiple lines, you need to use something like "[\s\S]*" 2. If you want to match only those H REF matches that are followed by your tag, you could use a look-ahead assertion: <A\sHREF=.*(?=[\S\s]*/TR) Note that the use of the dot in the URL will restrict to only those URL's that are on a single line. If your URL's might extend across more than one line, then: <A\sHREF=[\s\S]*?(?=[\S\s]*/TR) --ron |
#3
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron- thank you for your reply. In the sample HTML in the original post, the
only URL I ultimately need is <A Href=javascript:openDocument('0900043d803a1ce4'); because it is the only one where the text between that URL and the next includes the text: <td class='classtd' Land Spread Vector '<- what I really need to know </td ..... </TR Your last suggested regex was very helpful; I changed it to only look for the LSV as follows: <A\sHREF=[\s\S]*?(?=[\S\s]*Land Spread Vector) It returned the target URL, but also returned the URL above it, presumably because they are both followed by the LSV (oops!). I like the idea of using regex to only return the URLs that are followed by LSV (saves me two steps!) but I'd need to learn how to have the regex not return the URL if it hits another URL before the LSV. The alternative would be to return everything between the URL and the /TR (multiple lines of text) which would not cut across multiple URLs, and I could look to see if there was an LSV within that returned text block. The expression above is only returning the URL line itself, not the multiple lines of text that end in </TR Thanks for any advice! Keith "Ron Rosenfeld" wrote in message ... On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01" wrote: I'm parsing an HTML file, and originally, I thought I only needed to capture all the links- the following worked well in my particular application (sample HTML snippet pasted at bottom of post): ^<A HREF=.* However, now I've found that I only need to capture and process certain links. The information that will determine whether a link needs to be processed is buried between the original link and the next link (or EOF), so I need to capture a larger (multiline) section of text and test each one to see if it contains my identifier. It appears that I'm safe using the </TR tag as something that always comes after my new identifier and before the next link (or EOF). So, I'm trying to edit my regex so I can grab this larger (multiline) section of text, then if the identifier is the correct one, I'll use my first regex expression or a slightly modified version to grab just the URL from within the match. I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source on regex expressions, but when I test my code on http://regexlib.com/RETester.aspx I'm getting no results (my first expression worked fine). Any assistance would be greatly appreciated. I think I'm pretty close, but the following isn't working: ^<A HREF=.*/TR Any advice? The only difference is replacing the single '' with '/TR'. I suspect it may have to do with spaces or linebreaks, but I don't know for certain. I'm posting a sample of my much larger HTML below; I'm trying to only capture the ^<A HREF=.* URL match for items where the class td includes "Land Spread Vector". I prefer using multiple simple Regex expressions versus one donated expression that does it all, so I can understand my own code and at least attempt to troubleshoot if I need to change anything. Thanks! Keith Your description and the data confuses me a bit. IT might be clearer to me if you posted exactly which links you expect to extract. However, two suggestions: 1. In VBA, dot (".") never matches newline. So if you want to devise an expression that will match across multiple lines, you need to use something like "[\s\S]*" 2. If you want to match only those H REF matches that are followed by your tag, you could use a look-ahead assertion: <A\sHREF=.*(?=[\S\s]*/TR) Note that the use of the dot in the URL will restrict to only those URL's that are on a single line. If your URL's might extend across more than one line, then: <A\sHREF=[\s\S]*?(?=[\S\s]*/TR) --ron |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
application.match with multi-dimensional arrays (syntax request) | Excel Programming | |||
Help with a Regex Pattern | Excel Programming | |||
Regex techniques | Excel Programming | |||
RegEx to parse something like this... | Excel Programming | |||
Regex Question | Excel Programming |