Home |
Search |
Today's Posts |
|
#1
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: Auric__ wrote: Robert Baer wrote: And assuming a fix, what can i do about the OPEN command/syntax? // What i did in Excel: S$ = "D:\Website\Send .Hot\****" tmp = Environ("TEMP")& "\"& S$ The contents of the variable S$ at this point: S$ = "C:\Users\auric\D:\Website\Send .Hot\****" Do you see the problem? Also, as Garry pointed out, cleanup should happen automatically. The "Kill" keyword deletes files. Try this code: [snip] Grumble..do not understand well enough to get working. Now i do not know what i had that fully worked with the gTX.htm file. The following "almost" works; it fails on the open. You know, one of us is confused, and I'm not entirely sure it isn't me. I've given you (theoretically) working code twice now, and yet you insist on making some pretty radical changes that DON'T ****ING WORK! So, let's step back from the coding for a moment, and let's have you explain ***EXACTLY*** what it is you want done. Give examples like, "given data X, I want to do Y, with result Z." Unless I get a clearer explanation of what you're trying to do, I'm done with this thread. I wish to read and parse every page of "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" where the page number goes from 5 to 999. On each page, find "<a href="/NSN/5960 [it is longer, but that is the start]. Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and find the line ending "(MCRL)". Read abut 4 lines to <a href="/PartNumber/ which is <a href="/PartNumber/GV4S1400" in this case. save/write that line plus the next three; close this secondary online URL and step to next "<a href="/NSN/5960 to process the same way. Continue to end of the page, close that URL and open the next page. Crude code: CLOSE ' PRS5960.BAS (QuickBasic) ' watch linewrap below.. SRC1$ = "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" SRC2$ = "https://www.nsncenter.com/NSN/5960" 'example only FSC$ = "/NSN/5960" OPEN "FSC5960.TXT" FOR APPEND AS #9 ' Let page number run from 05 to 39 to read existing files FOR PG = 5 TO 39 A$ = "" FPG$ = RIGHT$("0" + MID$(STR$(PG), 2), 2) ' These files, FPG$ + ".txt" are copies from the web OPEN FPG$ + ".txt" FOR INPUT AS #1 ON ERROR GOTO END1 PRINT FPG$ + ".txt", 'is screen note to me WHILE NOT EOF(1) WHILE INSTR(A$, FSC$) = 0 'skip 7765 lines of junk LINE INPUT #1, A$ 'look for <a href="/NSN/5960-00-754-5782" Class= ETC WEND P = INSTR(A$, FSC$) + 9: FPG2$ = SRC2$ + MID$(A$, P, 12) NSN$ = "5960" + MID$(A$, P, 12) PRINT NSN$ 'is screen note to me AHREF$ = ".." + FSC$ + MID$(A$, P, 12) 'Need URL FPG2$ or .. a href to get balance of data ' See comments above this program PRINT #9, NSN$ LINE INPUT #1, A$ WEND END1: RESUME LAB LAB: CLOSE #1 NEXT PG CLOSE SYSTEM ** Note the Function URLDownloadToFile does not allow spaces; there is one in the "page" URL. Problem #2, the Function URLDownloadToFile does not allow https website URLs. Other than those problems, i have everything else working fine. |
#2
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4): 1st occurance of <a href="/NSN/5960 is at line 7878; 1st occurance of (MCRL) is at line 7931; 1st occurance after that of <a href="/PartNumber" is this at line 7951; <td align="center" style="vertical-align: middle;"<a href="/PartNumber/GV4S1400"GV4S1400</a</td and the next 3 lines a <td style="width: 125px; height: 60px; vertical-align: middle;" align="center" nowrap <a href="/CAGE/63060"63060</a </td <td align="center" style="vertical-align: middle;" <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a </td <td text-align="center" style="vertical-align: middle;"<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td So you want to go to the next page linked to and repeat the process? At this point my Excel sheet has been modified as follows: Source | NSN Item# | Description | Part# | MCRL# Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477 line1 line2 line3 ...and so on. So far, I'm working with text files and so I'm inclined to append each item to a file named "ElectronTube_NSN5960.txt". File contents for the 2 items above would be structured so the 1st line contains headings (data fields) so it works properly with ADODB. (Note that I use a comma as delimiter, and the file does not contain any blank lines)... Source,NSN Item#,Description,Part#,MCRL# Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477 <a href="/CAGE/1VPW8"1VPW8</a <a href="/CAGE/1VPW8"<img class="img-thumbnail" src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8" height=45 width=90 /</a <a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS, INC.</a ...where I have parsed off the CSS formatting text and html tags outside <a...</a from the 3 lines. I'd likely convert the UCase to proer case as well. The file size is 653 bytes meaning a full page would be about 4kb; 1000 pages being about 4mb. That's 44 lines per page after the fields line. A file this size can be easily handled via ADO recordset or std VB file I/O functions/methods. Loading into an array (vData) puts fields in vData(0) and records starting at vData(1), and looping would Step 4. I really don't have the time/energy (I have Lou Gehrig's) to get any more involved with your project due to current commitments. I just felt it might be worth explaining how I'd handle your task in the hopes it would be helpful to you reaching a viable solution. I bid you good wishes going forward... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#3
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
So what you also want is the linked file (web page) the image or part# links to! Here's what I got from https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4): 1st occurance of <a href="/NSN/5960 is at line 7878; 1st occurance of (MCRL) is at line 7931; 1st occurance after that of <a href="/PartNumber" is this at line 7951; <td align="center" style="vertical-align: middle;"<a href="/PartNumber/GV4S1400"GV4S1400</a</td and the next 3 lines a <td style="width: 125px; height: 60px; vertical-align: middle;" align="center" nowrap <a href="/CAGE/63060"63060</a </td <td align="center" style="vertical-align: middle;" <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a </td <td text-align="center" style="vertical-align: middle;"<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td So you want to go to the next page linked to and repeat the process? At this point my Excel sheet has been modified as follows: Source | NSN Item# | Description | Part# | MCRL# Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477 line1 line2 line3 ..and so on. So far, I'm working with text files and so I'm inclined to append each item to a file named "ElectronTube_NSN5960.txt". File contents for the 2 items above would be structured so the 1st line contains headings (data fields) so it works properly with ADODB. (Note that I use a comma as delimiter, and the file does not contain any blank lines)... Source,NSN Item#,Description,Part#,MCRL# Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653 <a href="/CAGE/63060"63060</a <a href="/CAGE/63060"<img class="img-thumbnail" src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45 width=90 /</a <a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477 <a href="/CAGE/1VPW8"1VPW8</a <a href="/CAGE/1VPW8"<img class="img-thumbnail" src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8" height=45 width=90 /</a <a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS, INC.</a ..where I have parsed off the CSS formatting text and html tags outside <a...</a from the 3 lines. I'd likely convert the UCase to proer case as well. The file size is 653 bytes meaning a full page would be about 4kb; 1000 pages being about 4mb. That's 44 lines per page after the fields line. A file this size can be easily handled via ADO recordset or std VB file I/O functions/methods. Loading into an array (vData) puts fields in vData(0) and records starting at vData(1), and looping would Step 4. I really don't have the time/energy (I have Lou Gehrig's) to get any more involved with your project due to current commitments. I just felt it might be worth explaining how I'd handle your task in the hopes it would be helpful to you reaching a viable solution. I bid you good wishes going forward... * Thanks for the guide. You are getting all of the right stuff from what i would call the second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I think that by now you have it all figured out. In snooping around,i have just stumbled on the ADODB scheme,and what little i have found so far it looks very promising. Only one example which does not work (examples NEVER work) and zero explanations so far. It seems that with the proper code, that ADODB would allow me to copy those first files to a HD. Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? Thanks again. |
#4
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
You are getting all of the right stuff from what i would call the
second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I got Source, NSN Part#, Description from the 1st file. The NSN Item# links to the 2nd file. <snip Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? I use std VB file I/O not ADODB. Initial procedure was to copy/paste page source into Textpad and save as Tmp.txt. Then load the file into an array and parse from there. I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts, but haven't had the time. I assume this would definitely give you an advantage over trying to automate IE, but I need to research using it. I do have URL functions built into my fpSpread.ocx for doing this stuff, but that's an expensive 3rd party AX component. Otherwise, doing this from Excel isn't something I'm familiar with. -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#5
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#6
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! Way beyond me. If in HTML one can copy an <a href="..." to the hard drive, then that is all i need. |
#7
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts While I'm on pause waiting to consult with client on my current project... This is doable; -I have a userform with web browser, a textbox, and some buttons. The Web Browser doesn't display AddressBar/StatusBar for some reason, even though these props are set 'True'. (Initial URL (pg1) is hard-coded as a result) You navigate to parent pages using Next/Last buttons, starting with pg1 on load. Optionally, you can enter a page# in a GoTo box. The browser lets you select links, and you load its current document page source into txtViewSrc via btnViewSrc. This action also Splits() page source into vPgSrc for locating search text selected in cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired lines at present, but I will have it appending them to file shortly. This file will be structured same as illustrated earlier. I think this could be fully automated after I see how the links are define in their innerHTML. For now, I'll provide a button to write found lines because it gives an opportunity to preview the data going into your file. This will happen via loading found lines into vaLinesOut() which is sized 0to3. This will make the search sequence important so the output file has its lines in the correct order (top-to-bottom in page source). I use my own file read/write procedures because they're configured for large amounts of data in 1 shot to/from dbase.txt files, and so are included in the userform class. While there's still a manual element to this, it's going to be orders of magnitude less daunting and more efficient that what you do now. It seems highly likely over time that this entire task can be fully automated just by entering the URL for pg1! Way beyond me. If in HTML one can copy an <a href="..." to the hard drive, then that is all i need. Just another approach, since you seem to be having difficulty getting URLDownloadToFile() to work. My approach reads innerHTML of web pages and outputs to txt file. Not sure why you want to grab html and save to disc given the file size is a concern. My approach puts parsed data from all 999 pages into a txt file less than 4mb in size. Once the individual steps have been optimized, automating the entire process will be easy. (I'll leave that part for you to do however you want it to work) I will post the contents of my fParseWebPages.frm file. You will need to set a ref to the Microsoft Web Browser to use it. -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#8
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
You are getting all of the right stuff from what i would call the second file. The first file is "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" & PageNum where PagNum (in ASCII) goes from "1" to "999". Note the (implied?) space in the URL. I got Source, NSN Part#, Description from the 1st file. The NSN Item# links to the 2nd file. <snip Would you be so kind as to share your working ADODB code? Or did you hand-copy the source like i did? I use std VB file I/O not ADODB. Initial procedure was to copy/paste page source into Textpad and save as Tmp.txt. Then load the file into an array and parse from there. I thought I'd take a look at going with a userform and MS Web Browser control for more flexible programming opts, but haven't had the time. I assume this would definitely give you an advantage over trying to automate IE, but I need to research using it. I do have URL functions built into my fpSpread.ocx for doing this stuff, but that's an expensive 3rd party AX component. Otherwise, doing this from Excel isn't something I'm familiar with. Check. I know QBASIC fairly well, so a lot of that knowledge crosses over to VB. Someone here was kind enough to give me a full working program that can be used to copy a URL source to a temp file on the HD. Once available all else is very simple and straight forward. The rub is that function (or something it uses) does not allow a space in the URL,AND also does not allow https. So, i need two work-arounds, and the https part would seem to be the worst. I do not know how it works, what DLLs/libraries it calls; no useful information is available. It is: Declare Function URLDownloadToFile Lib "urlmon" _ Alias "URLDownloadToFileA" (ByVal pCaller As Long, _ ByVal szURL As String, ByVal szFileName As String, _ ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long Only the well-known keywords can be found; 'urlmon', 'pCaller', 'szURL', and 'szFileName' are unknowns and not findable in the so-called VB help. And there are no examples; the few ranDUMB ones are incomplete and/or do not work.. I do not see how you use std VB file I/O; AFAIK one cannot open a web page as if it was a file. |
#9
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I wish to read and parse every page of
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" where the page number goes from 5 to 999. On each page, find "<a href="/NSN/5960 [it is longer, but that is the start]. Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and find the line ending "(MCRL)". Read abut 4 lines to <a href="/PartNumber/ which is <a href="/PartNumber/GV4S1400" in this case. save/write that line plus the next three; close this secondary online URL and step to next "<a href="/NSN/5960 to process the same way. Continue to end of the page, close that URL and open the next page. Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these filenames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ...where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#10
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Typos...
Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these fieldnames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#11
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Typos... Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these fieldnames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... Like i said, PERFECT! And you are correct, do not need line 3 nor the extended data. Please check my other answer for a corrected search term which needs a !corrected! human-readable version of the URL is: https://www.nsncenter.com/NSNSearch?q=5960 regulator and "ELECTRON TUBE"&PageNumber=1 The %20 is a virtual space, and the %22 is a virtual quote. ((guess that is the proper term)) |
#12
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I wish to read and parse every page of "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" where the page number goes from 5 to 999. On each page, find "<a href="/NSN/5960 [it is longer, but that is the start]. Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and find the line ending "(MCRL)". Read abut 4 lines to <a href="/PartNumber/ which is <a href="/PartNumber/GV4S1400" in this case. save/write that line plus the next three; close this secondary online URL and step to next "<a href="/NSN/5960 to process the same way. Continue to end of the page, close that URL and open the next page. Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these filenames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... WOW! Absolutely PERFECT! You are correct, #1) do not need that line 3, and #2) do not need the extended info. File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. |
#13
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
You are correct, #1) do not need that line 3, and #2) do not need
the extended info. Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. Ok, then ouput filename will be: NSN_5960.txt I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Works nice! Now I get 11 5960 items per parent page. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish What is WGET? -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#14
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
You are correct, #1) do not need that line 3, and #2) do not need the extended info. Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. Ok, then ouput filename will be: NSN_5960.txt I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Works nice! Now I get 11 5960 items per parent page. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish What is WGET? WGET is a command line program that will copy contents of an URL to the hard drive; it has various options, for SSL, i think for some processing, for giving the output file a specific name, for recursion, etc. Was still trying to find ways to copy the online file to the hard drive. I still do not understand what magic you used. Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? |
#15
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I still do not understand what magic you used.
I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#16
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I still do not understand what magic you used. I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! The search engine was not exactly forthcoming, to say the least; everything including the kitchen sink but NOT anything alcoholic. "Timmies drink" helped some; fifth "hit" down: "Timmy's Sweet and Sour mix Cocktails and Drink Recipes". Using "Timmies, straight up" was slightly better.."Average night at the Manotick Timmies... : ottawa" In all of this,a lot of "hits" mentioned something(always different) about Tim Hortons Franchise. Absolutely no clue regarding rum, scotch, vodka (or dare i say) milk. |
#17
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. -- My life is richer, somehow, simply because I know that he exists. |
#18
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...2&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Suggestions? |
#19
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
Auric__ wrote: Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; Don't worry about it. It can be used to set default behaviors but every entry can be replicated via switches. am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO N%20TUBE%22&PageNumber=2 That wget line performs as expected for me: 5960_002.TXT contains valid HTML (although I haven't made any attempt to check the data; it looks like most of the page is CSS) and log002.TXT is a typical wget log of a successful transfer. As for truncating the filenames, if I remove the --output-document switch, the filename I get is NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, If you're talking about GNUwin32, that version is years out of date. but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Never even heard of them. Suggestions? I'm using 1.16.3. No idea where I got it. The batch file that I use for downloading looks like this: call wget --no-check-certificate -x -c -e robots=off -i new.txt %* -x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt). -c Continue interrupted downloads. -e Do this .wgetrc thing (in this case, ignore the robots.txt file). -i Read list of filenames from the following file ("new.txt" because that's the default name for a new file in my file manager). I use the -i switch so I don't have to worry about escaping characters or % vs %%. Whatever's in the text file is exactly what it looks for. (If you go this route, it's one file per line.) -- We have to stop letting George Lucas name our politicians. |
#20
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: Auric__ wrote: Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; Don't worry about it. It can be used to set default behaviors but every entry can be replicated via switches. am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO N%20TUBE%22&PageNumber=2 That wget line performs as expected for me: 5960_002.TXT contains valid HTML (although I haven't made any attempt to check the data; it looks like most of the page is CSS) and log002.TXT is a typical wget log of a successful transfer. As for truncating the filenames, if I remove the --output-document switch, the filename I get is NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, If you're talking about GNUwin32, that version is years out of date. but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Never even heard of them. Suggestions? I'm using 1.16.3. No idea where I got it. The batch file that I use for downloading looks like this: call wget --no-check-certificate -x -c -e robots=off -i new.txt %* -x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt). -c Continue interrupted downloads. -e Do this .wgetrc thing (in this case, ignore the robots.txt file). -i Read list of filenames from the following file ("new.txt" because that's the default name for a new file in my file manager). I use the -i switch so I don't have to worry about escaping characters or % vs %%. Whatever's in the text file is exactly what it looks for. (If you go this route, it's one file per line.) You must have a different version of Wget; whatever i do on the command line,including the "trick" of restrict-file-names=nocontrol, i get a buggered path name plus the response &PageNumber not recognized. Exactly same results in Win2K, WinXP or in Win7. Yes, i used GNUwin32 as SourceForge "complete" of Wget had no EXE. Is there some other (compiled, complete) source i should get? |
#21
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. I also tried versions 1.18 and 1.13 from https://eternallybored.org/misc/wget/. Exactly the same truncation and gibberish. At least, the 1.13 ZIP had wgetrc in the /etc folder; perhaps one step forward. No nobody said where to put the folder set, and certainly nothing about set path, which just maybe perhaps might be useful for operation. |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
EOF Parse Text file | Excel Programming | |||
Parse a txt file and save as csv? | Excel Programming | |||
parse from txt file | Excel Programming | |||
Parse File Location | Excel Worksheet Functions | |||
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) | Excel Programming |