ExcelBanter

ExcelBanter (https://www.excelbanter.com/)
-   Excel Programming (https://www.excelbanter.com/excel-programming/)
-   -   Regex syntax request for help (https://www.excelbanter.com/excel-programming/406310-regex-syntax-request-help.html)

Ker_01

Regex syntax request for help
 
I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


<A Href=javascript:openDocument('0900043d802b3528');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998

</a

</td

<td class='classtd'

Green-tipped Martin

</td

<td class='classtd'

CURRENT,3.2

</td



</TR



<TR

<TD</TD

<TD

<A Href=javascript:openDocument('0900043d803a1ce4');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998 - APRRE - Assert.doc

</a

</td

<td class='classtd'

Land Spread Vector

</td

<td class='classtd'

CURRENT,3.0

</td



</TR



<TR

<TD</TD

<TD

<A Href=javascript:openDocument('0900043d802b635e');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998-R

</a

</td

<td class='classtd'

Reevaluation

</td

<td class='classtd'

CURRENT,1.0

</td



</TR

</TD</TR</TABLE<BR<BR

<CENTER

<A Href='javascript:history.back();'<img
src='/OurDir/images/back_down.jpg' border=0 align='center'
alt='Back'</A&nbsp;

<A Href='javascript:goHome();'<img
src='/OurDir/images/home_down.jpg' border=0 align='center' alt='Home'</A

</CENTER

</BODY

</HTML




Ron Rosenfeld

Regex syntax request for help
 
On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01" wrote:

I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


Your description and the data confuses me a bit. IT might be clearer to me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by your
tag, you could use a look-ahead assertion:

<A\sHREF=.*(?=[\S\s]*/TR)

Note that the use of the dot in the URL will restrict to only those URL's that
are on a single line. If your URL's might extend across more than one line,
then:

<A\sHREF=[\s\S]*?(?=[\S\s]*/TR)


--ron

Ker_01

Regex syntax request for help
 
Ron- thank you for your reply. In the sample HTML in the original post, the
only URL I ultimately need is
<A Href=javascript:openDocument('0900043d803a1ce4');

because it is the only one where the text between that URL and the next
includes the text:
<td class='classtd'
Land Spread Vector '<- what I really need to know
</td
.....
</TR

Your last suggested regex was very helpful; I changed it to only look for
the LSV as follows:
<A\sHREF=[\s\S]*?(?=[\S\s]*Land Spread Vector)

It returned the target URL, but also returned the URL above it, presumably
because they are both followed by the LSV (oops!). I like the idea of using
regex to only return the URLs that are followed by LSV (saves me two steps!)
but I'd need to learn how to have the regex not return the URL if it hits
another URL before the LSV.

The alternative would be to return everything between the URL and the /TR
(multiple lines of text) which would not cut across multiple URLs, and I
could look to see if there was an LSV within that returned text block. The
expression above is only returning the URL line itself, not the multiple
lines of text that end in </TR

Thanks for any advice!
Keith

"Ron Rosenfeld" wrote in message
...
On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01"
wrote:

I'm parsing an HTML file, and originally, I thought I only needed to
capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF),
so
I need to capture a larger (multiline) section of text and test each one
to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful
source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


Your description and the data confuses me a bit. IT might be clearer to
me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use
something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by
your
tag, you could use a look-ahead assertion:

<A\sHREF=.*(?=[\S\s]*/TR)

Note that the use of the dot in the URL will restrict to only those URL's
that
are on a single line. If your URL's might extend across more than one
line,
then:

<A\sHREF=[\s\S]*?(?=[\S\s]*/TR)


--ron





All times are GMT +1. The time now is 01:53 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
ExcelBanter.com