Home |
Search |
Today's Posts |
|
#1
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4): 1st occurance of <a href="/NSN/5960 is at line 7878; 1st occurance of (MCRL) is at line 7931; 1st occurance after that of <a href="/PartNumber" is this at line 7951; <td align="center" style="vertical-align: middle;"<a href="/PartNumber/GV4S1400"GV4S1400</a</td and the next 3 lines a <td style="width: 125px; height: 60px; vertical-align: middle;" align="center" nowrap <a href="/CAGE/63060"63060</a </td <td align="center" style="vertical-align: middle;" <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a </td <td text-align="center" style="vertical-align: middle;"<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td So you want to go to the next page linked to and repeat the process? At this point my Excel sheet has been modified as follows: Source | NSN Item# | Description | Part# | MCRL# Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477 line1 line2 line3 ...and so on. So far, I'm working with text files and so I'm inclined to append each item to a file named "ElectronTube_NSN5960.txt". File contents for the 2 items above would be structured so the 1st line contains headings (data fields) so it works properly with ADODB. (Note that I use a comma as delimiter, and the file does not contain any blank lines)... Source,NSN Item#,Description,Part#,MCRL# Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477 <a href="/CAGE/1VPW8"1VPW8</a <a href="/CAGE/1VPW8"<img class="img-thumbnail" src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8" height=45 width=90 /</a <a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS, INC.</a ...where I have parsed off the CSS formatting text and html tags outside <a...</a from the 3 lines. I'd likely convert the UCase to proer case as well. The file size is 653 bytes meaning a full page would be about 4kb; 1000 pages being about 4mb. That's 44 lines per page after the fields line. A file this size can be easily handled via ADO recordset or std VB file I/O functions/methods. Loading into an array (vData) puts fields in vData(0) and records starting at vData(1), and looping would Step 4. I really don't have the time/energy (I have Lou Gehrig's) to get any more involved with your project due to current commitments. I just felt it might be worth explaining how I'd handle your task in the hopes it would be helpful to you reaching a viable solution. I bid you good wishes going forward... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#2
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
So what you also want is the linked file (web page) the image or part# links to! Here's what I got from https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4): 1st occurance of <a href="/NSN/5960 is at line 7878; 1st occurance of (MCRL) is at line 7931; 1st occurance after that of <a href="/PartNumber" is this at line 7951; <td align="center" style="vertical-align: middle;"<a href="/PartNumber/GV4S1400"GV4S1400</a</td and the next 3 lines a <td style="width: 125px; height: 60px; vertical-align: middle;" align="center" nowrap <a href="/CAGE/63060"63060</a </td <td align="center" style="vertical-align: middle;" <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a </td <td text-align="center" style="vertical-align: middle;"<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td So you want to go to the next page linked to and repeat the process? At this point my Excel sheet has been modified as follows: Source | NSN Item# | Description | Part# | MCRL# Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477 line1 line2 line3 ..and so on. So far, I'm working with text files and so I'm inclined to append each item to a file named "ElectronTube_NSN5960.txt". File contents for the 2 items above would be structured so the 1st line contains headings (data fields) so it works properly with ADODB. (Note that I use a comma as delimiter, and the file does not contain any blank lines)... Source,NSN Item#,Description,Part#,MCRL# Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477 <a href="/CAGE/1VPW8"1VPW8</a <a href="/CAGE/1VPW8"<img class="img-thumbnail" src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8" height=45 width=90 /</a <a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS, INC.</a ..where I have parsed off the CSS formatting text and html tags outside <a...</a from the 3 lines. I'd likely convert the UCase to proer case as well. The file size is 653 bytes meaning a full page would be about 4kb; 1000 pages being about 4mb. That's 44 lines per page after the fields line. A file this size can be easily handled via ADO recordset or std VB file I/O functions/methods. Loading into an array (vData) puts fields in vData(0) and records starting at vData(1), and looping would Step 4. I really don't have the time/energy (I have Lou Gehrig's) to get any more involved with your project due to current commitments. I just felt it might be worth explaining how I'd handle your task in the hopes it would be helpful to you reaching a viable solution. I bid you good wishes going forward... * Thanks for the guide. You are getting all of the right stuff from what i would call the second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I think that by now you have it all figured out. In snooping around,i have just stumbled on the ADODB scheme,and what little i have found so far it looks very promising. Only one example which does not work (examples NEVER work) and zero explanations so far. It seems that with the proper code, that ADODB would allow me to copy those first files to a HD. Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? Thanks again. |
#3
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
You are getting all of the right stuff from what i would call the
second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I got Source, NSN Part#, Description from the 1st file. The NSN Item# links to the 2nd file. <snip Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? I use std VB file I/O not ADODB. Initial procedure was to copy/paste page source into Textpad and save as Tmp.txt. Then load the file into an array and parse from there. I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts, but haven't had the time. I assume this would definitely give you an advantage over trying to automate IE, but I need to research using it. I do have URL functions built into my fpSpread.ocx for doing this stuff, but that's an expensive 3rd party AX component. Otherwise, doing this from Excel isn't something I'm familiar with. -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#4
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#5
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! Way beyond me. If in HTML one can copy an <a href="..." to the hard drive, then that is all i need. |
#6
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! Way beyond me. If in HTML one can copy an <a href="..." to the hard drive, then that is all i need. Just another approach, since you seem to be having difficulty getting URLDownloadToFile() to work. My approach reads innerHTML of web pages and outputs to txt file. Not sure why you want to grab html and save to disc given the file size is a concern. My approach puts parsed data from all 999 pages into a txt file less than 4mb in size. Once the individual steps have been optimized, automating the entire process will be easy. (I'll leave that part for you to do however you want it to work) I will post the contents of my fParseWebPages.frm file. You will need to set a ref to the Microsoft Web Browser to use it. -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#7
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
GS wrote: I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! Way beyond me. If in HTML one can copy an <a href="..." to the hard drive, then that is all i need. Just another approach, since you seem to be having difficulty getting URLDownloadToFile() to work. * snooped around the Excel VB help to find alternate ways (i remember seeing other ways). Here was the first step: SRC1$ = "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=5" ''test.. With Worksheets(1) .Hyperlinks.Add .Range("E5"), SRC1$ End With ** That places the URL in cell E5 and it looked good. Clicking once on it "to follow", i get error message "Unable to open...Cannot download the information you requested." This may have the same 2 problems,the (implied) space,and the https. So, to test those,i added a folder on one of my websites: http://www.oil4lessllc.com/Try%20.space/ ...and that works. Therefore, this scheme is one better than the function. Next, i tested the https with: https://duckduckgo.com/ . I get that error message. So i am still dead in the water. My approach reads innerHTML of web pages and outputs to txt file. * THAT is what i cannot do; AFAIK neither HTML5 nor javascript can write to a file. Not sure why you want to grab html and save to disc given the file size is a concern. * Well, when i first started, i got a bit scared because of those 7750 (appx) lines; first time i saw a web site that huge. Now that i have processed 30 pages, i know better. My approach puts parsed data from all 999 pages into a txt file less than 4mb in size. Once the individual steps have been optimized, automating the entire process will be easy. (I'll leave that part for you to do however you want it to work) * So far, in that direction, i have: <header <a href="https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=5" id="NSN5860" hello<BR</a <script var T = document.anchors; document.getElementById("NSN5860").innerHTML = T; // open an output stream in a new window document.open("text/html",replace); document.writeln(T+"|<BR"); // display in new window document.close(); </script <!-- above gets [object HTMLCollection] on new screen/window, where that/object item is the href pointer -- I will post the contents of my fParseWebPages.frm file. You will need to set a ref to the Microsoft Web Browser to use it. |
#8
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
You are getting all of the right stuff from what i would call the second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I got Source, NSN Part#, Description from the 1st file. The NSN Item# links to the 2nd file. <snip Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? I use std VB file I/O not ADODB. Initial procedure was to copy/paste page source into Textpad and save as Tmp.txt. Then load the file into an array and parse from there. I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts, but haven't had the time. I assume this would definitely give you an advantage over trying to automate IE, but I need to research using it. I do have URL functions built into my fpSpread.ocx for doing this stuff, but that's an expensive 3rd party AX component. Otherwise, doing this from Excel isn't something I'm familiar with. Check. I know QBASIC fairly well, so a lot of that knowledge crosses over to VB. Someone here was kind enough to give me a full working program that can be used to copy a URL source to a temp file on the HD. Once available all else is very simple and straight forward. The rub is that function (or something it uses) does not allow a space in the URL,AND also does not allow https. So, i need two work-arounds, and the https part would seem to be the worst. I do not know how it works, what DLLs/libraries it calls; no useful information is available. It is: Declare Function URLDownloadToFile Lib "urlmon" _ Alias "URLDownloadToFileA" (ByVal pCaller As Long, _ ByVal szURL As String, ByVal szFileName As String, _ ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long Only the well-known keywords can be found; 'urlmon', 'pCaller', 'szURL', and 'szFileName' are unknowns and not findable in the so-called VB help. And there are no examples; the few ranDUMB ones are incomplete and/or do not work.. I do not see how you use std VB file I/O; AFAIK one cannot open a web page as if it was a file. |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
EOF Parse Text file | Excel Programming | |||
Parse a txt file and save as csv? | Excel Programming | |||
parse from txt file | Excel Programming | |||
Parse File Location | Excel Worksheet Functions | |||
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) | Excel Programming |