On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote:
Greetings all,
I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.
The expected result is:
/site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542
However, if I do get a non-null result back, it is usually:
http://www.bestbuy.com/site/olspage....ry&id=cat00000
---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? "
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"
---------------------- Partial Source Data (from website):
----------------------
<div class="logo"
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"<img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/</a
</div
<td class="skucontent"
<a href="/site/olspage.jsp?skuId=8936896&st=Transformers
+Widescreen&type=product&id=1754542" class="prodlink"
Transformers - Widescreen Dubbed Subtitle AC3</a<br/
---------------------- ---------------------- ----------------------
I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!
And here's another version that might work a bit better, depending on your
specific requirements. It has no problem with embedded quotes in the URL. This
uses the Replace method to get rid of everything else.
==============================
Option Explicit
Function MovieURL(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = _
"[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*"
MovieURL = re.Replace(str, "$1")
End Function
==============================
--ron