View Single Post
  #10   Report Post  
Posted to microsoft.public.excel.programming
Ron Rosenfeld Ron Rosenfeld is offline
external usenet poster
 
Posts: 5,651
Default VBA Regular Expressions & URL Extraction

On Sat, 7 Feb 2009 07:34:41 -0800 (PST), Akrobrat
wrote:

Ron,

Sorry, by error I meant that there wasn't a match for either of the
examples. In fact, the last Pattern suggestion stalled Excel / VBA.

I do have a reference set to MS VBScript Regex 5.5, and I don't know
if this matters much, but I use a Windows XP SP2 machine.
Instead of the posting the contents of strPageContent (very long),
here is the URL to the webpage that it contains:
http://www.bestbuy.com/site/olspage....egories&ks=960

I'll be AWOL for the next 36 hours but I appreciate your help!


I don't believe the problem is in your regex pattern.

I obtained the data by going to the above URL, executing the View Source
command in FireFox; copying that data, and then processing it within VBA. The
expected URL was returned.

(I had to do the extraction entirely within VBA as the page source is too long
to fit into an Excel cell.

The following worked fine, after copying the data to the clipboard and also
setting a reference to Microsoft Forms 2.0 Object Library:

=============================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard
PageSource = myData.GetText
Debug.Print MovieURL(PageSource)
End Sub
'---------------------------------------
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.MultiLine = True
myRegExp.IgnoreCase = True
myRegExp.Global = False
myRegExp.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count = 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
==============================

Of interest, in the particular page you sent me to, there are 11 different URL
snippets that meet your criteria of ending with classid="prodlink"

However, unlike your example, all of them include this string in the middle:

;jsessionid=BGJY2UTT13ORTKC4D3IFAGA

Also, they differ by the skuid.

I'm guessing you don't want the jsessionid information.

A slight change will exclude that from the result:

==============================================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard

PageSource = myData.GetText

Debug.Print MovieURL(PageSource)
End Sub
'-----------------------------------
Function MovieURL(str As String) As String
Dim myRegExp As RegExp
Dim myMatches As MatchCollection

Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = _
"""([^""]+)(;jsessionid[^""]+)(\?[^""]+)(?=""\s*class=""prodlink"")"

If myRegExp.Test(str) = True Then
Set myMatches = myRegExp.Execute(str)
MovieURL = myMatches(0).SubMatches(0) & _
myMatches(0).SubMatches(2)
End If
End Function
====================================
--ron