Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
..Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!

  #2   Report Post  
Posted to microsoft.public.excel.programming
jaf jaf is offline
external usenet poster
 
Posts: 300
Default VBA Regular Expressions & URL Extraction

Hi,
If.
Let me stress that.
IF this were a hyperlink on a webpage "/site/olspage.jsp" would be pointing to the subfolder "/site" and the file "olspage.jsp" on
the WEBSERVER and would function.

As a hyperlink in Excel this will not work without the http://www. unless you have a folder on your hard drive named "/site" and
Excel can find it without the rest of the path.

John


"Akrobrat" wrote in message ...
Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!


  #3   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote:

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!



Try this:

===================
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count = 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
=====================
--ron
  #4   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"


I did not tease this apart, but:

To include " in VBA, you need to "double" them.

The angle bracket is not a meta-character and the initial "\" should be
removed. (Unless you are trying to match a beginning of word boundary, in
which case you could use "\b(?=\w)" )

Dot will not match newline in VBScript or JavaScript flavors. Since your data
extends over several lines, it is likely that your "." is not doing what you
expect.
--ron
  #5   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Jaf,

Good point. I do indeed concatenate "http://..." with the partial URL
and that works provided that I extract the correct URL. Thanks.

- Dilan


  #6   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 06 Feb 2009 20:55:05 -0500, Ron Rosenfeld
wrote:

===================
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count = 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
=====================


A bit simpler, but works on your example:

===================
Option Explicit
Function MovieURL(str As String) As String
Dim myRegExp As Object, myMatches As Object
Set myRegExp = CreateObject("vbscript.regexp")
myRegExp.Pattern = "[^""]+(?=""\s*class=""prodlink"")"
If myRegExp.Test(str) = True Then
Set myMatches = myRegExp.Execute(str)
MovieURL = myMatches(0).Value
End If
End Function
=====================

By the way, both of my variants depend on there not being quotes (") within the
extracted portion.
--ron
  #7   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Ron,

Thanks for the info. As you suggested, I tried:
..Pattern = "href=""([^""]+)""\s*class=""prodlink"""
.... but it didn't work. I too suspected the dot might be a source of
trouble, but I've implemented it successfully on other movie sites.

So that I'm not providing false assumptions, here is my code in
context. Note that I typically ignore the case and avoid a global
search because I only want the first accurate match - which is my core
problem. How do I get the regex pattern to match the HREF tag when it
contains the CLASS="prodlink" tag? I realize the code could be neater
(and I appreciate any advice), but all I want to do right now is
develop an accurate pattern.


--------------------------------------

Private Sub getBestBuy(strItem As String, lngRow As Long, IE As
InternetExplorer)

Dim RegexURL As RegExp
Dim RegexMatch As MatchCollection
Dim strURL As String

Set RegexURL = New RegExp

With RegexURL
.MultiLine = True
.IgnoreCase = True
.Global = False
'The following pattern was recommended by Ron from Google
Groups
.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink"""
End With

With IE
.Visible = True
'Replace spaces with plus signs in URL as per BB's approach
.Navigate "http://www.bestbuy.com/site/olspage.jsp?
_dyncharset=ISO-8859-1&id=pcat17071&type=page&st=" & Replace(strItem,
" ", "+") & "&sc=Global&cp=1&nrp=15&sp=&qp=&list=n&iht=y&usc=A ll
+Categories&ks=960"
End With

'Load the page completely
Do Until IE.ReadyState = READYSTATE_COMPLETE And IE.Busy = False
Loop

'Get the URL of the first listed item, which is usually the
most accurate based on the keyword search
strPageContent = IE.Document.body.innerHTML

If RegexURL.Test(strPageContent) Then
Set RegexMatch = RegexURL.Execute(strPageContent)
Range("K" & lngRow).Value = "http://www.bestbuy.com" &
RegexMatch(0).SubMatches(0)
End If

'//--------------- Rest of code continues here ---------------//

End Sub

  #8   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote:

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!



And here's another version that might work a bit better, depending on your
specific requirements. It has no problem with embedded quotes in the URL. This
uses the Replace method to get rid of everything else.

==============================
Option Explicit
Function MovieURL(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = _
"[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*"
MovieURL = re.Replace(str, "$1")
End Function
==============================
--ron
  #9   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Sat, 7 Feb 2009 06:30:19 -0800 (PST), Akrobrat
wrote:

Thanks for the info. As you suggested, I tried:
.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
... but it didn't work. I too suspected the dot might be a source of
trouble, but I've implemented it successfully on other movie sites.


Hmmm.

What does "didn't work" mean? No match or an error message?

That pattern works fine here using the data you provided, as well as a minor
variation I included in the data.

I assume you have a reference set to "Microsoft VBScript Regular Expressions"
5.5 or else you'd be getting some error message from VBA.

In this context, setting the Multiline and Global parameters is irrelevant, but
harmless.

Perhaps if you posted the full contents of strPageContent I could make some
further suggestions.

Or perhaps you could try the Replace method I posted in a different context.
--ron
  #10   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction


Hmmm.

What does "didn't work" mean? *No match or an error message?

That pattern works fine here using the data you provided, as well as a minor
variation I included in the data.

I assume you have a reference set to "Microsoft VBScriptRegularExpressions"
5.5 or else you'd be getting some error message fromVBA.

In this context, setting the Multiline and Global parameters is irrelevant, but
harmless.

Perhaps if you posted the full contents of strPageContent I could make some
further suggestions.

Or perhaps you could try the Replace method I posted in a different context.



  #11   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Sat, 7 Feb 2009 07:34:41 -0800 (PST), Akrobrat
wrote:

Ron,

Sorry, by error I meant that there wasn't a match for either of the
examples. In fact, the last Pattern suggestion stalled Excel / VBA.

I do have a reference set to MS VBScript Regex 5.5, and I don't know
if this matters much, but I use a Windows XP SP2 machine.
Instead of the posting the contents of strPageContent (very long),
here is the URL to the webpage that it contains:
http://www.bestbuy.com/site/olspage....egories&ks=960

I'll be AWOL for the next 36 hours but I appreciate your help!


I don't believe the problem is in your regex pattern.

I obtained the data by going to the above URL, executing the View Source
command in FireFox; copying that data, and then processing it within VBA. The
expected URL was returned.

(I had to do the extraction entirely within VBA as the page source is too long
to fit into an Excel cell.

The following worked fine, after copying the data to the clipboard and also
setting a reference to Microsoft Forms 2.0 Object Library:

=============================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard
PageSource = myData.GetText
Debug.Print MovieURL(PageSource)
End Sub
'---------------------------------------
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.MultiLine = True
myRegExp.IgnoreCase = True
myRegExp.Global = False
myRegExp.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count = 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
==============================

Of interest, in the particular page you sent me to, there are 11 different URL
snippets that meet your criteria of ending with classid="prodlink"

However, unlike your example, all of them include this string in the middle:

;jsessionid=BGJY2UTT13ORTKC4D3IFAGA

Also, they differ by the skuid.

I'm guessing you don't want the jsessionid information.

A slight change will exclude that from the result:

==============================================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard

PageSource = myData.GetText

Debug.Print MovieURL(PageSource)
End Sub
'-----------------------------------
Function MovieURL(str As String) As String
Dim myRegExp As RegExp
Dim myMatches As MatchCollection

Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = _
"""([^""]+)(;jsessionid[^""]+)(\?[^""]+)(?=""\s*class=""prodlink"")"

If myRegExp.Test(str) = True Then
Set myMatches = myRegExp.Execute(str)
MovieURL = myMatches(0).SubMatches(0) & _
myMatches(0).SubMatches(2)
End If
End Function
====================================
--ron
  #12   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Sat, 07 Feb 2009 09:48:29 -0500, Ron Rosenfeld
wrote:

On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote:

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!



And here's another version that might work a bit better, depending on your
specific requirements. It has no problem with embedded quotes in the URL. This
uses the Replace method to get rid of everything else.

==============================
Option Explicit
Function MovieURL(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = _
"[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*"
MovieURL = re.Replace(str, "$1")
End Function
==============================
--ron


Well, given your data and noting that there are multiple matches, I probably
wouldn't use this method.
--ron
  #13   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,934
Default VBA Regular Expressions & URL Extraction

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


Does it have to be a Regular Expression solution? This function will return
the first prodlink site (your expected result above from the Partial Source
Data text you posted)...

Function GetURL(S As String) As String
GetURL = Split(S, "class=""prodlink""", , vbTextCompare)(0)
GetURL = Mid(GetURL, InStrRev(GetURL, "href=") + 6)
GetURL = Replace(Replace(GetURL, vbLf, ""), vbCr, "")
GetURL = Trim(Replace(GetURL, """", ""))
End Function

--
Rick (MVP - Excel)


"Akrobrat" wrote in message
...
Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Wide screen&amp;type=product&amp;id=1754542


However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000



---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div

<td class="skucontent"

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!


  #14   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Ron, my approach with Regular Expression was to capture the *first*
occurence of the HREF tag with the CLASS="prodlink" tag. However, you
might be right in that there's more going on here than expected - no
matter how much I tweak the regex, the returned value is either
incorrect or null. Your pattern suggestions will be useful for other
sites I hope to cover soon.

Rick, I don't have to use regex for this task, but I found that it was
a good solution early on in my project. Your function (with a couple
of modifications to handle the full HTML source) helped me solve the
problem.

Thank you immensely to both of you for helping with this challenge!
  #15   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Mon, 9 Feb 2009 08:04:33 -0800 (PST), Akrobrat
wrote:

Ron, my approach with Regular Expression was to capture the *first*
occurence of the HREF tag with the CLASS="prodlink" tag. However, you
might be right in that there's more going on here than expected - no
matter how much I tweak the regex, the returned value is either
incorrect or null. Your pattern suggestions will be useful for other
sites I hope to cover soon.

Rick, I don't have to use regex for this task, but I found that it was
a good solution early on in my project. Your function (with a couple
of modifications to handle the full HTML source) helped me solve the
problem.

Thank you immensely to both of you for helping with this challenge!


Glad you got it working. But it would be interesting to figure out why the
regex approach isn't working, especially since I was using the same data
source.
--ron


  #16   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Glad you got it working. *But it would be interesting to figure out why the
regex approach isn't working, especially since I was using the same data
source.
--ron


Ron, I still don't know why the regex worked for you and not for me.
I sit behind a large corporation's firewall and use IE 6.0 that's been
slightly customized for the company's purposes. Could that have
something to do with it? Outside of that, I can't imagine what could
be wrong. Everything else is pretty standard - I use Win XP, Excel
2003 and the Microsoft VBScript Regular Expressions 5.5 library.

Thanks again.
  #17   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Thu, 12 Feb 2009 07:07:07 -0800 (PST), Akrobrat
wrote:

Glad you got it working. *But it would be interesting to figure out why the
regex approach isn't working, especially since I was using the same data
source.
--ron


Ron, I still don't know why the regex worked for you and not for me.
I sit behind a large corporation's firewall and use IE 6.0 that's been
slightly customized for the company's purposes. Could that have
something to do with it? Outside of that, I can't imagine what could
be wrong. Everything else is pretty standard - I use Win XP, Excel
2003 and the Microsoft VBScript Regular Expressions 5.5 library.

Thanks again.


I suppose there could be something about the data, but don't know what it is.

I tried again, using mostly the code you had previously posted, with some
variation to use data that had been copied to the clipboard, instead of
obtained via IE, and also to print the result to the immediate window, instead
of the worksheet, and it continues to work OK here.
--ron
  #18   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Ron (and all),

Here's something else I found this morning that might be of interest:
In the following string from an HTML doc:

<p id="height10" class="price-strike2"$2,599.00/EA Each</p

.... I am unable to match it (aiming for the currency bit) using the
following reduced regular expressions within VBA:

..Pattern = "price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "price-strike2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"


Any Regexp tester tells me that all of the above would match the
string. Also, there is only one instance of a tag including the
phrase "strike2" in the entire HTML source, so there can be, at most,
only one match. The currency regexp is fine (I know this from
previous tests) and the pattern preceding it seems to be the root of
the problem. It is not essential that I get a solution to this, but
you can imagine that this is frustrating! =)

Thanks,
Dilan
  #19   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 13 Feb 2009 07:14:37 -0800 (PST), Akrobrat
wrote:

Ron (and all),

Here's something else I found this morning that might be of interest:
In the following string from an HTML doc:

<p id="height10" class="price-strike2"$2,599.00/EA Each</p

... I am unable to match it (aiming for the currency bit) using the
following reduced regular expressions within VBA:

.Pattern = "price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "price-strike2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"


Any Regexp tester tells me that all of the above would match the
string. Also, there is only one instance of a tag including the
phrase "strike2" in the entire HTML source, so there can be, at most,
only one match. The currency regexp is fine (I know this from
previous tests) and the pattern preceding it seems to be the root of
the problem. It is not essential that I get a solution to this, but
you can imagine that this is frustrating! =)

Thanks,
Dilan


Dilan,

I wonder if you are running into the same problem you did with the previous
Regex you could not get to work. (What I suspect is a problem in your VBA
routines).

Your first regex works fine here, returning the "currency bit" into the first
submatch.

Here is the routine I used with your regex and sample:

============================
Option Explicit
Function GetStrike2(sStr As String)
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
If re.test(sStr) = True Then
Set mc = re.Execute(sStr)
GetStrike2 = mc(0).submatches(0)
End If
End Function
================================

Against your sample, this returns $2,599.00
--ron
  #20   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

So do you think there's a difference between:
Dim re as Object
Set re = CreateObject("vbscript.regexp")
.... and ...
Dim re as RegExp
Set re = New RegExp
?

I'll test it but I wondered if you have some insight into this. The
rest of my code (the actual regexp pattern test) is the same as yours.

Thanks!


  #21   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Fri, 13 Feb 2009 11:50:21 -0800 (PST), Akrobrat
wrote:

So do you think there's a difference between:
Dim re as Object
Set re = CreateObject("vbscript.regexp")
... and ...
Dim re as RegExp
Set re = New RegExp
?

I'll test it but I wondered if you have some insight into this. The
rest of my code (the actual regexp pattern test) is the same as yours.

Thanks!


No, there should not be any difference so long as you have a reference set to
Microsoft VBScript Regular Expressions 5.5. (That is not required with my
code). Also, if you have the reference set, I would DIM mc as match collection
rather than as object.

For trouble shooting, I would suggest you first try my code exactly as written,
all by itself (i.e. not embedded within your code).

In other words, enter the code I posted all by itself in a module.

Then --
A1: (please copy/paste from here)

<p id="height10" class="price-strike2"$2,599.00/EA Each</p

B1:
=GetStrike2($A$1)

And see if that works. (returns $2,599.00)

If that works, and you want to use early-binding, set the Reference and use
this code instead:

====================================
Option Explicit
Function GetStrike2(sStr As String)
Dim re As RegExp, mc As MatchCollection
Set re = New RegExp
re.IgnoreCase = True
re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
If re.test(sStr) = True Then
Set mc = re.Execute(sStr)
GetStrike2 = mc(0).submatches(0)
End If
Set re = Nothing
End Function
===============================

Again, use it "stand-alone"

If both work, then your problem is elsewhere.
--ron
  #22   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Ron and all,

I used Ron's function on a "stand-alone" basis and used a subroutine
(TestGetStrike) to pass the HTML source (as string), but it didn't
return anything. No errors, but no result.

I haven't got a clue as to why this doesn't work. Do you think I have
VB references set that are clashing? I've currently got:
- Visual Basic for Applications
- Microsoft Excel 11.0 Object Library
- Microsoft Internet Controls
- Microsoft VBScript Regular Expressions 5.5
- OLE Automation
- Microsoft Office 11.0 Object Library
- Microsoft Forms 2.0 Object Library

I moved the RegExp and Internet Controls up in priority too. I also
don't have any items in Excel that have been disabled (listed under
Help About MS Excel). Any thoughts?


====================================

Option Explicit

Sub TestGetStrike()

Dim IE As InternetExplorer

Set IE = New InternetExplorer
IE.Visible = True
IE.Navigate "http://www.homedepot.com/webapp/wcs/stores/servlet/
ProductDisplay?
jspStoreDir=hdus&catalogId=10053&productId=1004014 83&navFlow=3&keyword=jvm1790sk"

'Load the page completely
Do Until IE.ReadyState = READYSTATE_COMPLETE And IE.Busy = False
Loop

Range("A1").Value = GetStrike2(IE.Document.body.innerHTML)

End Sub


Function GetStrike2(sStr As String)
Dim re As RegExp, mc As MatchCollection
Set re = New RegExp
re.IgnoreCase = True
re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)"
If re.Test(sStr) = True Then
Set mc = re.Execute(sStr)
GetStrike2 = mc(0).SubMatches(0)
End If
Set re = Nothing
End Function

====================================
  #23   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 10
Default VBA Regular Expressions & URL Extraction

Ron,

Sorry, in my frustration with this whole thing, I forgot to note that
your code as a Worksheet function worked perfectly, returning the
correct $ result. Therefore, this problem MUST be somewhere else, and
that is what I'm trying to figure out. Thanks, Ron!

- Dilan
  #24   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Mon, 16 Feb 2009 08:35:25 -0800 (PST), Akrobrat
wrote:

Ron,

Sorry, in my frustration with this whole thing, I forgot to note that
your code as a Worksheet function worked perfectly, returning the
correct $ result. Therefore, this problem MUST be somewhere else, and
that is what I'm trying to figure out. Thanks, Ron!

- Dilan


We're making progress.

Where I would suggest you check next is to ensure that the HTML derived string
being passed to the VBA function is, in fact, what it should be.

One way to do that would be to use the code I provided as a function. So you
might have something like:

===========================
Sub Dilan1()
....get the HTML string and assign it to strPageContent

strURL = getStrike2(strPageContent)

debug.print "result = " & strURL

end sub
====================

If the immediate window only prints result = , then it is likely the case
that strPageContent does not contain the relevant HTML substring.

If the immediate window prints the correct result, then your problem is
downstream.
--ron
  #25   Report Post  
Posted to microsoft.public.excel.programming
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Mon, 16 Feb 2009 08:30:38 -0800 (PST), Akrobrat
wrote:

- Microsoft Internet Controls


By the way, where did you find this?

I cannot locate a reference by that name in my Tools/References dialog box.
--ron
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Get rid of with regular expressions Howdy Excel Discussion (Misc queries) 1 January 18th 10 07:42 PM
Regular expressions Dave Excel Programming 5 September 20th 07 02:46 PM
Using Regular Expressions with VBA Andrew Hall NZ Excel Programming 5 November 21st 06 09:30 PM
Regular expressions JeffMelton Excel Programming 2 March 1st 06 12:52 AM
VBA and Regular expressions Friedrich Muecke Excel Programming 3 October 3rd 03 01:46 AM


All times are GMT +1. The time now is 02:58 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"