Home |
Search |
Today's Posts |
#1
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Greetings all,
I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- ..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " ..Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" ..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! |
#2
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Hi,
If. Let me stress that. IF this were a hyperlink on a webpage "/site/olspage.jsp" would be pointing to the subfolder "/site" and the file "olspage.jsp" on the WEBSERVER and would function. As a hyperlink in Excel this will not work without the http://www. unless you have a folder on your hard drive named "/site" and Excel can find it without the rest of the path. John "Akrobrat" wrote in message ... Greetings all, I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! |
#3
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote: Greetings all, I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! Try this: =================== Function MovieURL(str As String) As String Dim myRegExp, myMatches, ResultString Set myRegExp = New RegExp myRegExp.IgnoreCase = True myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink""" Set myMatches = myRegExp.Execute(str) If myMatches.Count = 1 Then MovieURL = myMatches(0).SubMatches(0) End If End Function ===================== --ron |
#4
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote: ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" I did not tease this apart, but: To include " in VBA, you need to "double" them. The angle bracket is not a meta-character and the initial "\" should be removed. (Unless you are trying to match a beginning of word boundary, in which case you could use "\b(?=\w)" ) Dot will not match newline in VBScript or JavaScript flavors. Since your data extends over several lines, it is likely that your "." is not doing what you expect. --ron |
#5
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Jaf,
Good point. I do indeed concatenate "http://..." with the partial URL and that works provided that I extract the correct URL. Thanks. - Dilan |
#6
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 06 Feb 2009 20:55:05 -0500, Ron Rosenfeld
wrote: =================== Function MovieURL(str As String) As String Dim myRegExp, myMatches, ResultString Set myRegExp = New RegExp myRegExp.IgnoreCase = True myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink""" Set myMatches = myRegExp.Execute(str) If myMatches.Count = 1 Then MovieURL = myMatches(0).SubMatches(0) End If End Function ===================== A bit simpler, but works on your example: =================== Option Explicit Function MovieURL(str As String) As String Dim myRegExp As Object, myMatches As Object Set myRegExp = CreateObject("vbscript.regexp") myRegExp.Pattern = "[^""]+(?=""\s*class=""prodlink"")" If myRegExp.Test(str) = True Then Set myMatches = myRegExp.Execute(str) MovieURL = myMatches(0).Value End If End Function ===================== By the way, both of my variants depend on there not being quotes (") within the extracted portion. --ron |
#7
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron,
Thanks for the info. As you suggested, I tried: ..Pattern = "href=""([^""]+)""\s*class=""prodlink""" .... but it didn't work. I too suspected the dot might be a source of trouble, but I've implemented it successfully on other movie sites. So that I'm not providing false assumptions, here is my code in context. Note that I typically ignore the case and avoid a global search because I only want the first accurate match - which is my core problem. How do I get the regex pattern to match the HREF tag when it contains the CLASS="prodlink" tag? I realize the code could be neater (and I appreciate any advice), but all I want to do right now is develop an accurate pattern. -------------------------------------- Private Sub getBestBuy(strItem As String, lngRow As Long, IE As InternetExplorer) Dim RegexURL As RegExp Dim RegexMatch As MatchCollection Dim strURL As String Set RegexURL = New RegExp With RegexURL .MultiLine = True .IgnoreCase = True .Global = False 'The following pattern was recommended by Ron from Google Groups .Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink""" End With With IE .Visible = True 'Replace spaces with plus signs in URL as per BB's approach .Navigate "http://www.bestbuy.com/site/olspage.jsp? _dyncharset=ISO-8859-1&id=pcat17071&type=page&st=" & Replace(strItem, " ", "+") & "&sc=Global&cp=1&nrp=15&sp=&qp=&list=n&iht=y&usc=A ll +Categories&ks=960" End With 'Load the page completely Do Until IE.ReadyState = READYSTATE_COMPLETE And IE.Busy = False Loop 'Get the URL of the first listed item, which is usually the most accurate based on the keyword search strPageContent = IE.Document.body.innerHTML If RegexURL.Test(strPageContent) Then Set RegexMatch = RegexURL.Execute(strPageContent) Range("K" & lngRow).Value = "http://www.bestbuy.com" & RegexMatch(0).SubMatches(0) End If '//--------------- Rest of code continues here ---------------// End Sub |
#8
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat
wrote: Greetings all, I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! And here's another version that might work a bit better, depending on your specific requirements. It has no problem with embedded quotes in the URL. This uses the Replace method to get rid of everything else. ============================== Option Explicit Function MovieURL(str As String) As String Dim re As Object Set re = CreateObject("vbscript.regexp") re.Global = True re.IgnoreCase = True re.Pattern = _ "[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*" MovieURL = re.Replace(str, "$1") End Function ============================== --ron |
#9
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Sat, 7 Feb 2009 06:30:19 -0800 (PST), Akrobrat
wrote: Thanks for the info. As you suggested, I tried: .Pattern = "href=""([^""]+)""\s*class=""prodlink""" ... but it didn't work. I too suspected the dot might be a source of trouble, but I've implemented it successfully on other movie sites. Hmmm. What does "didn't work" mean? No match or an error message? That pattern works fine here using the data you provided, as well as a minor variation I included in the data. I assume you have a reference set to "Microsoft VBScript Regular Expressions" 5.5 or else you'd be getting some error message from VBA. In this context, setting the Multiline and Global parameters is irrelevant, but harmless. Perhaps if you posted the full contents of strPageContent I could make some further suggestions. Or perhaps you could try the Replace method I posted in a different context. --ron |
#10
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]() Hmmm. What does "didn't work" mean? *No match or an error message? That pattern works fine here using the data you provided, as well as a minor variation I included in the data. I assume you have a reference set to "Microsoft VBScriptRegularExpressions" 5.5 or else you'd be getting some error message fromVBA. In this context, setting the Multiline and Global parameters is irrelevant, but harmless. Perhaps if you posted the full contents of strPageContent I could make some further suggestions. Or perhaps you could try the Replace method I posted in a different context. |
#11
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Sat, 7 Feb 2009 07:34:41 -0800 (PST), Akrobrat
wrote: Ron, Sorry, by error I meant that there wasn't a match for either of the examples. In fact, the last Pattern suggestion stalled Excel / VBA. I do have a reference set to MS VBScript Regex 5.5, and I don't know if this matters much, but I use a Windows XP SP2 machine. Instead of the posting the contents of strPageContent (very long), here is the URL to the webpage that it contains: http://www.bestbuy.com/site/olspage....egories&ks=960 I'll be AWOL for the next 36 hours but I appreciate your help! I don't believe the problem is in your regex pattern. I obtained the data by going to the above URL, executing the View Source command in FireFox; copying that data, and then processing it within VBA. The expected URL was returned. (I had to do the extraction entirely within VBA as the page source is too long to fit into an Excel cell. The following worked fine, after copying the data to the clipboard and also setting a reference to Microsoft Forms 2.0 Object Library: ============================= Option Explicit Sub GetClipboardData() Dim PageSource As String Dim myData As DataObject Set myData = New DataObject myData.GetFromClipboard PageSource = myData.GetText Debug.Print MovieURL(PageSource) End Sub '--------------------------------------- Function MovieURL(str As String) As String Dim myRegExp, myMatches, ResultString Set myRegExp = New RegExp myRegExp.MultiLine = True myRegExp.IgnoreCase = True myRegExp.Global = False myRegExp.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink""" Set myMatches = myRegExp.Execute(str) If myMatches.Count = 1 Then MovieURL = myMatches(0).SubMatches(0) End If End Function ============================== Of interest, in the particular page you sent me to, there are 11 different URL snippets that meet your criteria of ending with classid="prodlink" However, unlike your example, all of them include this string in the middle: ;jsessionid=BGJY2UTT13ORTKC4D3IFAGA Also, they differ by the skuid. I'm guessing you don't want the jsessionid information. A slight change will exclude that from the result: ============================================== Option Explicit Sub GetClipboardData() Dim PageSource As String Dim myData As DataObject Set myData = New DataObject myData.GetFromClipboard PageSource = myData.GetText Debug.Print MovieURL(PageSource) End Sub '----------------------------------- Function MovieURL(str As String) As String Dim myRegExp As RegExp Dim myMatches As MatchCollection Set myRegExp = New RegExp myRegExp.IgnoreCase = True myRegExp.Pattern = _ """([^""]+)(;jsessionid[^""]+)(\?[^""]+)(?=""\s*class=""prodlink"")" If myRegExp.Test(str) = True Then Set myMatches = myRegExp.Execute(str) MovieURL = myMatches(0).SubMatches(0) & _ myMatches(0).SubMatches(2) End If End Function ==================================== --ron |
#12
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Sat, 07 Feb 2009 09:48:29 -0500, Ron Rosenfeld
wrote: On Fri, 6 Feb 2009 13:20:31 -0800 (PST), Akrobrat wrote: Greetings all, I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! And here's another version that might work a bit better, depending on your specific requirements. It has no problem with embedded quotes in the URL. This uses the Replace method to get rid of everything else. ============================== Option Explicit Function MovieURL(str As String) As String Dim re As Object Set re = CreateObject("vbscript.regexp") re.Global = True re.IgnoreCase = True re.Pattern = _ "[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*" MovieURL = re.Replace(str, "$1") End Function ============================== --ron Well, given your data and noting that there are multiple matches, I probably wouldn't use this method. --ron |
#13
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
The expected result is:
/site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 Does it have to be a Regular Expression solution? This function will return the first prodlink site (your expected result above from the Partial Source Data text you posted)... Function GetURL(S As String) As String GetURL = Split(S, "class=""prodlink""", , vbTextCompare)(0) GetURL = Mid(GetURL, InStrRev(GetURL, "href=") + 6) GetURL = Replace(Replace(GetURL, vbLf, ""), vbCr, "") GetURL = Trim(Replace(GetURL, """", "")) End Function -- Rick (MVP - Excel) "Akrobrat" wrote in message ... Greetings all, I am trying to extract the URLs of a set of animated movies off various sites using regular expressions and then dump those URLs into an Excel document (via VBA). I have a decent grasp of regex but I have hit a brick wall lately with a particular site. I have experimented with a number of patterns but cannot yet get the correct result. The expected result is: /site/olspage.jsp?skuId=8936896&st=Transformers+Wide screen&type=product&id=1754542 However, if I do get a non-null result back, it is usually: http://www.bestbuy.com/site/olspage....ry&id=cat00000 ---------------------- Sample Patterns Tested: ---------------------- .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W? " .Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class" .Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?" ---------------------- Partial Source Data (from website): ---------------------- <div class="logo" <a href="http://www.bestbuy.com/site/olspage.jsp? type=category&id=cat00000" name="&lid=hdr_logo"<img src="http:// images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/</a </div <td class="skucontent" <a href="/site/olspage.jsp?skuId=8936896&st=Transformers +Widescreen&type=product&id=1754542" class="prodlink" Transformers - Widescreen Dubbed Subtitle AC3</a<br/ ---------------------- ---------------------- ---------------------- I'm most interested in utilizing the [class="prodlink"] string as this is the tag that labels a movie URL. I know that regex in VBA can be a bit tricky owing to the use of double quotes and other non-alpha characters, but can any of you guys spot what I'm doing wrong? Thanks for your help! |
#14
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron, my approach with Regular Expression was to capture the *first*
occurence of the HREF tag with the CLASS="prodlink" tag. However, you might be right in that there's more going on here than expected - no matter how much I tweak the regex, the returned value is either incorrect or null. Your pattern suggestions will be useful for other sites I hope to cover soon. Rick, I don't have to use regex for this task, but I found that it was a good solution early on in my project. Your function (with a couple of modifications to handle the full HTML source) helped me solve the problem. Thank you immensely to both of you for helping with this challenge! |
#15
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Mon, 9 Feb 2009 08:04:33 -0800 (PST), Akrobrat
wrote: Ron, my approach with Regular Expression was to capture the *first* occurence of the HREF tag with the CLASS="prodlink" tag. However, you might be right in that there's more going on here than expected - no matter how much I tweak the regex, the returned value is either incorrect or null. Your pattern suggestions will be useful for other sites I hope to cover soon. Rick, I don't have to use regex for this task, but I found that it was a good solution early on in my project. Your function (with a couple of modifications to handle the full HTML source) helped me solve the problem. Thank you immensely to both of you for helping with this challenge! Glad you got it working. But it would be interesting to figure out why the regex approach isn't working, especially since I was using the same data source. --ron |
#16
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Glad you got it working. *But it would be interesting to figure out why the
regex approach isn't working, especially since I was using the same data source. --ron Ron, I still don't know why the regex worked for you and not for me. I sit behind a large corporation's firewall and use IE 6.0 that's been slightly customized for the company's purposes. Could that have something to do with it? Outside of that, I can't imagine what could be wrong. Everything else is pretty standard - I use Win XP, Excel 2003 and the Microsoft VBScript Regular Expressions 5.5 library. Thanks again. |
#17
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Thu, 12 Feb 2009 07:07:07 -0800 (PST), Akrobrat
wrote: Glad you got it working. *But it would be interesting to figure out why the regex approach isn't working, especially since I was using the same data source. --ron Ron, I still don't know why the regex worked for you and not for me. I sit behind a large corporation's firewall and use IE 6.0 that's been slightly customized for the company's purposes. Could that have something to do with it? Outside of that, I can't imagine what could be wrong. Everything else is pretty standard - I use Win XP, Excel 2003 and the Microsoft VBScript Regular Expressions 5.5 library. Thanks again. I suppose there could be something about the data, but don't know what it is. I tried again, using mostly the code you had previously posted, with some variation to use data that had been copied to the clipboard, instead of obtained via IE, and also to print the result to the immediate window, instead of the worksheet, and it continues to work OK here. --ron |
#18
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron (and all),
Here's something else I found this morning that might be of interest: In the following string from an HTML doc: <p id="height10" class="price-strike2"$2,599.00/EA Each</p .... I am unable to match it (aiming for the currency bit) using the following reduced regular expressions within VBA: ..Pattern = "price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" ..Pattern = "price-strike2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)" ..Pattern = "strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" ..Pattern = "2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)" ..Pattern = "2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" Any Regexp tester tells me that all of the above would match the string. Also, there is only one instance of a tag including the phrase "strike2" in the entire HTML source, so there can be, at most, only one match. The currency regexp is fine (I know this from previous tests) and the pattern preceding it seems to be the root of the problem. It is not essential that I get a solution to this, but you can imagine that this is frustrating! =) Thanks, Dilan |
#19
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 13 Feb 2009 07:14:37 -0800 (PST), Akrobrat
wrote: Ron (and all), Here's something else I found this morning that might be of interest: In the following string from an HTML doc: <p id="height10" class="price-strike2"$2,599.00/EA Each</p ... I am unable to match it (aiming for the currency bit) using the following reduced regular expressions within VBA: .Pattern = "price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" .Pattern = "price-strike2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)" .Pattern = "strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" .Pattern = "2\W?(\$[0-9]*,?[0-9]*\.?[0-9]*)" .Pattern = "2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" Any Regexp tester tells me that all of the above would match the string. Also, there is only one instance of a tag including the phrase "strike2" in the entire HTML source, so there can be, at most, only one match. The currency regexp is fine (I know this from previous tests) and the pattern preceding it seems to be the root of the problem. It is not essential that I get a solution to this, but you can imagine that this is frustrating! =) Thanks, Dilan Dilan, I wonder if you are running into the same problem you did with the previous Regex you could not get to work. (What I suspect is a problem in your VBA routines). Your first regex works fine here, returning the "currency bit" into the first submatch. Here is the routine I used with your regex and sample: ============================ Option Explicit Function GetStrike2(sStr As String) Dim re As Object, mc As Object Set re = CreateObject("vbscript.regexp") re.IgnoreCase = True re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" If re.test(sStr) = True Then Set mc = re.Execute(sStr) GetStrike2 = mc(0).submatches(0) End If End Function ================================ Against your sample, this returns $2,599.00 --ron |
#20
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
So do you think there's a difference between:
Dim re as Object Set re = CreateObject("vbscript.regexp") .... and ... Dim re as RegExp Set re = New RegExp ? I'll test it but I wondered if you have some insight into this. The rest of my code (the actual regexp pattern test) is the same as yours. Thanks! |
#21
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Fri, 13 Feb 2009 11:50:21 -0800 (PST), Akrobrat
wrote: So do you think there's a difference between: Dim re as Object Set re = CreateObject("vbscript.regexp") ... and ... Dim re as RegExp Set re = New RegExp ? I'll test it but I wondered if you have some insight into this. The rest of my code (the actual regexp pattern test) is the same as yours. Thanks! No, there should not be any difference so long as you have a reference set to Microsoft VBScript Regular Expressions 5.5. (That is not required with my code). Also, if you have the reference set, I would DIM mc as match collection rather than as object. For trouble shooting, I would suggest you first try my code exactly as written, all by itself (i.e. not embedded within your code). In other words, enter the code I posted all by itself in a module. Then -- A1: (please copy/paste from here) <p id="height10" class="price-strike2"$2,599.00/EA Each</p B1: =GetStrike2($A$1) And see if that works. (returns $2,599.00) If that works, and you want to use early-binding, set the Reference and use this code instead: ==================================== Option Explicit Function GetStrike2(sStr As String) Dim re As RegExp, mc As MatchCollection Set re = New RegExp re.IgnoreCase = True re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" If re.test(sStr) = True Then Set mc = re.Execute(sStr) GetStrike2 = mc(0).submatches(0) End If Set re = Nothing End Function =============================== Again, use it "stand-alone" If both work, then your problem is elsewhere. --ron |
#22
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron and all,
I used Ron's function on a "stand-alone" basis and used a subroutine (TestGetStrike) to pass the HTML source (as string), but it didn't return anything. No errors, but no result. I haven't got a clue as to why this doesn't work. Do you think I have VB references set that are clashing? I've currently got: - Visual Basic for Applications - Microsoft Excel 11.0 Object Library - Microsoft Internet Controls - Microsoft VBScript Regular Expressions 5.5 - OLE Automation - Microsoft Office 11.0 Object Library - Microsoft Forms 2.0 Object Library I moved the RegExp and Internet Controls up in priority too. I also don't have any items in Excel that have been disabled (listed under Help About MS Excel). Any thoughts? ==================================== Option Explicit Sub TestGetStrike() Dim IE As InternetExplorer Set IE = New InternetExplorer IE.Visible = True IE.Navigate "http://www.homedepot.com/webapp/wcs/stores/servlet/ ProductDisplay? jspStoreDir=hdus&catalogId=10053&productId=1004014 83&navFlow=3&keyword=jvm1790sk" 'Load the page completely Do Until IE.ReadyState = READYSTATE_COMPLETE And IE.Busy = False Loop Range("A1").Value = GetStrike2(IE.Document.body.innerHTML) End Sub Function GetStrike2(sStr As String) Dim re As RegExp, mc As MatchCollection Set re = New RegExp re.IgnoreCase = True re.Pattern = """price-strike2""(\$[0-9]*,?[0-9]*\.?[0-9]*)" If re.Test(sStr) = True Then Set mc = re.Execute(sStr) GetStrike2 = mc(0).SubMatches(0) End If Set re = Nothing End Function ==================================== |
#23
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Ron,
Sorry, in my frustration with this whole thing, I forgot to note that your code as a Worksheet function worked perfectly, returning the correct $ result. Therefore, this problem MUST be somewhere else, and that is what I'm trying to figure out. Thanks, Ron! - Dilan |
#24
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Mon, 16 Feb 2009 08:35:25 -0800 (PST), Akrobrat
wrote: Ron, Sorry, in my frustration with this whole thing, I forgot to note that your code as a Worksheet function worked perfectly, returning the correct $ result. Therefore, this problem MUST be somewhere else, and that is what I'm trying to figure out. Thanks, Ron! - Dilan We're making progress. Where I would suggest you check next is to ensure that the HTML derived string being passed to the VBA function is, in fact, what it should be. One way to do that would be to use the code I provided as a function. So you might have something like: =========================== Sub Dilan1() ....get the HTML string and assign it to strPageContent strURL = getStrike2(strPageContent) debug.print "result = " & strURL end sub ==================== If the immediate window only prints result = , then it is likely the case that strPageContent does not contain the relevant HTML substring. If the immediate window prints the correct result, then your problem is downstream. --ron |
#25
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
On Mon, 16 Feb 2009 08:30:38 -0800 (PST), Akrobrat
wrote: - Microsoft Internet Controls By the way, where did you find this? I cannot locate a reference by that name in my Tools/References dialog box. --ron |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
Get rid of with regular expressions | Excel Discussion (Misc queries) | |||
Regular expressions | Excel Programming | |||
Using Regular Expressions with VBA | Excel Programming | |||
Regular expressions | Excel Programming | |||
VBA and Regular expressions | Excel Programming |