Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 100
Default Regex syntax request for help

I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


<A Href=javascript:openDocument('0900043d802b3528');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998

</a

</td

<td class='classtd'

Green-tipped Martin

</td

<td class='classtd'

CURRENT,3.2

</td



</TR



<TR

<TD</TD

<TD

<A Href=javascript:openDocument('0900043d803a1ce4');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998 - APRRE - Assert.doc

</a

</td

<td class='classtd'

Land Spread Vector

</td

<td class='classtd'

CURRENT,3.0

</td



</TR



<TR

<TD</TD

<TD

<A Href=javascript:openDocument('0900043d802b635e');

<img src=/OurDir/images/formats/f_msw8_16.gif border=0 align=left width=16

&nbsp;101998-R

</a

</td

<td class='classtd'

Reevaluation

</td

<td class='classtd'

CURRENT,1.0

</td



</TR

</TD</TR</TABLE<BR<BR

<CENTER

<A Href='javascript:history.back();'<img
src='/OurDir/images/back_down.jpg' border=0 align='center'
alt='Back'</A&nbsp;

<A Href='javascript:goHome();'<img
src='/OurDir/images/home_down.jpg' border=0 align='center' alt='Home'</A

</CENTER

</BODY

</HTML



  #2   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default Regex syntax request for help

On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01" wrote:

I'm parsing an HTML file, and originally, I thought I only needed to capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF), so
I need to capture a larger (multiline) section of text and test each one to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


Your description and the data confuses me a bit. IT might be clearer to me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by your
tag, you could use a look-ahead assertion:

<A\sHREF=.*(?=[\S\s]*/TR)

Note that the use of the dot in the URL will restrict to only those URL's that
are on a single line. If your URL's might extend across more than one line,
then:

<A\sHREF=[\s\S]*?(?=[\S\s]*/TR)


--ron
  #3   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 100
Default Regex syntax request for help

Ron- thank you for your reply. In the sample HTML in the original post, the
only URL I ultimately need is
<A Href=javascript:openDocument('0900043d803a1ce4');

because it is the only one where the text between that URL and the next
includes the text:
<td class='classtd'
Land Spread Vector '<- what I really need to know
</td
.....
</TR

Your last suggested regex was very helpful; I changed it to only look for
the LSV as follows:
<A\sHREF=[\s\S]*?(?=[\S\s]*Land Spread Vector)

It returned the target URL, but also returned the URL above it, presumably
because they are both followed by the LSV (oops!). I like the idea of using
regex to only return the URLs that are followed by LSV (saves me two steps!)
but I'd need to learn how to have the regex not return the URL if it hits
another URL before the LSV.

The alternative would be to return everything between the URL and the /TR
(multiple lines of text) which would not cut across multiple URLs, and I
could look to see if there was an LSV within that returned text block. The
expression above is only returning the URL line itself, not the multiple
lines of text that end in </TR

Thanks for any advice!
Keith

"Ron Rosenfeld" wrote in message
...
On Mon, 18 Feb 2008 10:26:16 -0400, "Ker_01"
wrote:

I'm parsing an HTML file, and originally, I thought I only needed to
capture
all the links- the following worked well in my particular application
(sample HTML snippet pasted at bottom of post):
^<A HREF=.*

However, now I've found that I only need to capture and process certain
links. The information that will determine whether a link needs to be
processed is buried between the original link and the next link (or EOF),
so
I need to capture a larger (multiline) section of text and test each one
to
see if it contains my identifier. It appears that I'm safe using the </TR
tag as something that always comes after my new identifier and before the
next link (or EOF). So, I'm trying to edit my regex so I can grab this
larger (multiline) section of text, then if the identifier is the correct
one, I'll use my first regex expression or a slightly modified version to
grab just the URL from within the match.

I've been using http://www.aivosto.com/vbtips/regex.html as a helpful
source
on regex expressions, but when I test my code on
http://regexlib.com/RETester.aspx I'm getting no results (my first
expression worked fine). Any assistance would be greatly appreciated. I
think I'm pretty close, but the following isn't working:
^<A HREF=.*/TR

Any advice? The only difference is replacing the single '' with '/TR'. I
suspect it may have to do with spaces or linebreaks, but I don't know for
certain.

I'm posting a sample of my much larger HTML below; I'm trying to only
capture the ^<A HREF=.* URL match for items where the class td includes
"Land Spread Vector".

I prefer using multiple simple Regex expressions versus one donated
expression that does it all, so I can understand my own code and at least
attempt to troubleshoot if I need to change anything.


Thanks!
Keith


Your description and the data confuses me a bit. IT might be clearer to
me if
you posted exactly which links you expect to extract.

However, two suggestions:

1. In VBA, dot (".") never matches newline. So if you want to devise an
expression that will match across multiple lines, you need to use
something
like "[\s\S]*"

2. If you want to match only those H REF matches that are followed by
your
tag, you could use a look-ahead assertion:

<A\sHREF=.*(?=[\S\s]*/TR)

Note that the use of the dot in the URL will restrict to only those URL's
that
are on a single line. If your URL's might extend across more than one
line,
then:

<A\sHREF=[\s\S]*?(?=[\S\s]*/TR)


--ron



Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
application.match with multi-dimensional arrays (syntax request) Keith R Excel Programming 4 June 28th 07 09:37 PM
Help with a Regex Pattern [email protected] Excel Programming 11 April 30th 07 01:49 AM
Regex techniques Dave Runyan Excel Programming 5 April 28th 07 12:17 AM
RegEx to parse something like this... R Avery Excel Programming 2 March 7th 05 06:41 PM
Regex Question William Barnes Excel Programming 5 January 2nd 04 11:57 AM


All times are GMT +1. The time now is 02:59 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"