Home |
Search |
Today's Posts |
#41
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Robert Baer wrote:
What is with this a*hole posting this sh*t here? It's just spam. Ignore it. -- If you would succeed, you must reduce your strategy to its point of application. |
#42
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I wish to read and parse every page of
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" where the page number goes from 5 to 999. On each page, find "<a href="/NSN/5960 [it is longer, but that is the start]. Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and find the line ending "(MCRL)". Read abut 4 lines to <a href="/PartNumber/ which is <a href="/PartNumber/GV4S1400" in this case. save/write that line plus the next three; close this secondary online URL and step to next "<a href="/NSN/5960 to process the same way. Continue to end of the page, close that URL and open the next page. Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these filenames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ...where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#43
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Typos...
Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these fieldnames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#44
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I wish to read and parse every page of "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" where the page number goes from 5 to 999. On each page, find "<a href="/NSN/5960 [it is longer, but that is the start]. Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and find the line ending "(MCRL)". Read abut 4 lines to <a href="/PartNumber/ which is <a href="/PartNumber/GV4S1400" in this case. save/write that line plus the next three; close this secondary online URL and step to next "<a href="/NSN/5960 to process the same way. Continue to end of the page, close that URL and open the next page. Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these filenames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... WOW! Absolutely PERFECT! You are correct, #1) do not need that line 3, and #2) do not need the extended info. File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. |
#45
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Typos... Robert, Here's what I have after parsing 'parent' pages for a list of its links: N/A 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 This is pg1 where the 1st link doesn't contain "5960" and so will be ignored. Each link's text is appended to this URL to bring up its 'child' pg: https://www.nsncenter.com/NSN/ Each child page is parsed for the following 4 lines: <TD style="VERTICAL-ALIGN: middle" align=center<A href="/PartNumber/GV3S2800"GV3S2800</A</TD <TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap align=center <A href="/CAGE/63060"63060</A </TD <TD style="VERTICAL-ALIGN: middle" align=center <A href="/CAGE/63060"<IMG class=img-thumbnail src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90 height=45</A </TD <TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD I'm stripping html syntax to get this data: Line1: PartNumber/GV3S2800 Line2: CAGE/63060 Line3: https://placehold.it/90x45?text=No%0DImage%0DYet Line4: HEICO OHMITE LLC The output file has these fieldnames in the 1st line: NSN Item#,Description,Part#,MCRL,CAGE,Source I left the 3rd line URL out since, outside its host webpage, it'll be useless to you. I need to know from you if the 3rd line URL is needed! Otherwise, the output file will have 1 line per item so it can be used as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for filename... I could extend the collected data to include... Reference Number/DRN_3570 Entity Code/DRN_9250 Category Code/DRN_2910 Variation Code/DRN_4780 ..where the fieldnames would then be: Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR The 1st record will be: 5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780 Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You could have 1000 parent pgs of data stored in a 1Mb file. Your feedback is appreciated... Like i said, PERFECT! And you are correct, do not need line 3 nor the extended data. Please check my other answer for a corrected search term which needs a !corrected! human-readable version of the URL is: https://www.nsncenter.com/NSNSearch?q=5960 regulator and "ELECTRON TUBE"&PageNumber=1 The %20 is a virtual space, and the %22 is a virtual quote. ((guess that is the proper term)) |
#46
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
You are correct, #1) do not need that line 3, and #2) do not need
the extended info. Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. Ok, then ouput filename will be: NSN_5960.txt I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Works nice! Now I get 11 5960 items per parent page. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish What is WGET? -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#47
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I still do not understand what magic you used.
I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#48
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
You are correct, #1) do not need that line 3, and #2) do not need the extended info. Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source File name(s) for PageNumber=1 I would use 5960_001.TXT,..to PageNumber=999 I would use 5960_999.TXT and that would preserve order. *OR* Reading & parsing from PageNumber=1 to PageNumber=999,one could append to same file (name NSN_5960.TXT); might as well - makes it easier to pour into a single Excel file. Either way is fine. Ok, then ouput filename will be: NSN_5960.txt I have found a way to get rid of items that are not strictly electron tubes and/or not regulators; that way you do not have to parse out these "unfit" items from first page description. Use: "https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1" Naturally, PageNumber still goes from 1 to 999. Note the implied "(", ")" and " "; human-readable "5960 regulator and (ELECTRON TUBE)". As far as i can tell, using that shows no undesirable parts. Works nice! Now I get 11 5960 items per parent page. Thanks! PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish What is WGET? WGET is a command line program that will copy contents of an URL to the hard drive; it has various options, for SSL, i think for some processing, for giving the output file a specific name, for recursion, etc. Was still trying to find ways to copy the online file to the hard drive. I still do not understand what magic you used. Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? |
#49
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. -- My life is richer, somehow, simply because I know that he exists. |
#50
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I still do not understand what magic you used. I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! The search engine was not exactly forthcoming, to say the least; everything including the kitchen sink but NOT anything alcoholic. "Timmies drink" helped some; fifth "hit" down: "Timmy's Sweet and Sour mix Cocktails and Drink Recipes". Using "Timmies, straight up" was slightly better.."Average night at the Manotick Timmies... : ottawa" In all of this,a lot of "hits" mentioned something(always different) about Tim Hortons Franchise. Absolutely no clue regarding rum, scotch, vodka (or dare i say) milk. |
#51
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...2&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Suggestions? |
#52
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. I also tried versions 1.18 and 1.13 from https://eternallybored.org/misc/wget/. Exactly the same truncation and gibberish. At least, the 1.13 ZIP had wgetrc in the /etc folder; perhaps one step forward. No nobody said where to put the folder set, and certainly nothing about set path, which just maybe perhaps might be useful for operation. |
#53
![]() |
|||
|
|||
![]()
sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i dùng, dịch vụ tuyệt trần, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả
|
#54
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
I still do not understand what magic you used. I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! The search engine was not exactly forthcoming, to say the least; everything including the kitchen sink but NOT anything alcoholic. "Timmies drink" helped some; fifth "hit" down: "Timmy's Sweet and Sour mix Cocktails and Drink Recipes". Using "Timmies, straight up" was slightly better.."Average night at the Manotick Timmies... : ottawa" In all of this,a lot of "hits" mentioned something(always different) about Tim Hortons Franchise. Absolutely no clue regarding rum, scotch, vodka (or dare i say) milk. Ha-ha! Ok.., 'Timmies' is fan-speak for Tim Horton's coffee!<g -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#55
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
everonvietnam2016 wrote:
sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i dùng, dịch vụ tuyệt trần, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả No kapish. Firstly, on my computer, all i see is a strange mix of characters from the 512 ASCII set. Secondly, i would not be able to read or understand your language even if it was elegantly rendered. |
#56
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
GS wrote: I still do not understand what magic you used. I'm using the MS WebBrowser control and a textbox on a worksheet! Now, the nitty-gritty; in exchange for that nicely parsed file, what do i owe you? A Timmies, straight up! The search engine was not exactly forthcoming, to say the least; everything including the kitchen sink but NOT anything alcoholic. "Timmies drink" helped some; fifth "hit" down: "Timmy's Sweet and Sour mix Cocktails and Drink Recipes". Using "Timmies, straight up" was slightly better.."Average night at the Manotick Timmies... : ottawa" In all of this,a lot of "hits" mentioned something(always different) about Tim Hortons Franchise. Absolutely no clue regarding rum, scotch, vodka (or dare i say) milk. Ha-ha! Ok.., 'Timmies' is fan-speak for Tim Horton's coffee!<g What i did WRT Wget, was uninstall it and checked that there were 'dregs' on the HD. Then i installed it from scratch, allowing all of the defaults. Finally, i modified the system path (shortened version): %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\Sy stem32\Wbem;C:\Program Files\GnuWin32; No joy. Even at the root, the system insists wget does not exist (as an executable, etc). |
#57
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Robert Baer wrote:
everonvietnam2016 wrote: sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i dùng, dịch vụ tuyệt trần, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả No kapish. Firstly, on my computer, all i see is a strange mix of characters from the 512 ASCII set. Secondly, i would not be able to read or understand your language even if it was elegantly rendered. It's Vietnamese. *Lots* of accented characters. Also, it's spam. -- Who says life is sacred? God? Hey, if you read your history, God is one of the leading causes of death. -- George Carlin |
#58
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
Auric__ wrote: Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; Don't worry about it. It can be used to set default behaviors but every entry can be replicated via switches. am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO N%20TUBE%22&PageNumber=2 That wget line performs as expected for me: 5960_002.TXT contains valid HTML (although I haven't made any attempt to check the data; it looks like most of the page is CSS) and log002.TXT is a typical wget log of a successful transfer. As for truncating the filenames, if I remove the --output-document switch, the filename I get is NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, If you're talking about GNUwin32, that version is years out of date. but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Never even heard of them. Suggestions? I'm using 1.16.3. No idea where I got it. The batch file that I use for downloading looks like this: call wget --no-check-certificate -x -c -e robots=off -i new.txt %* -x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt). -c Continue interrupted downloads. -e Do this .wgetrc thing (in this case, ignore the robots.txt file). -i Read list of filenames from the following file ("new.txt" because that's the default name for a new file in my file manager). I use the -i switch so I don't have to worry about escaping characters or % vs %%. Whatever's in the text file is exactly what it looks for. (If you go this route, it's one file per line.) -- We have to stop letting George Lucas name our politicians. |
#59
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: Auric__ wrote: Robert Baer wrote: PS: i found WGET to be non-useful (a) it truncates the filename (b) it buggers it to partial gibberish. Then you must be using a bad version, or perhaps have something wrong with your .wgetrc. I've been using wget for around 10 years, and never had anything like those issues unless I pass bad options. Know nothing about .wgetrc; Don't worry about it. It can be used to set default behaviors but every entry can be replicated via switches. am in Win2K cmd line, and the batch file used is: H: CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO N%20TUBE%22&PageNumber=2 That wget line performs as expected for me: 5960_002.TXT contains valid HTML (although I haven't made any attempt to check the data; it looks like most of the page is CSS) and log002.TXT is a typical wget log of a successful transfer. As for truncating the filenames, if I remove the --output-document switch, the filename I get is NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2 PAUSE The SourceForge site offered a Zip which was supposed to be complete, If you're talking about GNUwin32, that version is years out of date. but none of the created folders had an EXE (tried Win2K, WinXP, Win7). Found SofTonic offering only a plain jane wget.exe, which i am using, so that may be a buggered version. Never even heard of them. Suggestions? I'm using 1.16.3. No idea where I got it. The batch file that I use for downloading looks like this: call wget --no-check-certificate -x -c -e robots=off -i new.txt %* -x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt). -c Continue interrupted downloads. -e Do this .wgetrc thing (in this case, ignore the robots.txt file). -i Read list of filenames from the following file ("new.txt" because that's the default name for a new file in my file manager). I use the -i switch so I don't have to worry about escaping characters or % vs %%. Whatever's in the text file is exactly what it looks for. (If you go this route, it's one file per line.) You must have a different version of Wget; whatever i do on the command line,including the "trick" of restrict-file-names=nocontrol, i get a buggered path name plus the response &PageNumber not recognized. Exactly same results in Win2K, WinXP or in Win7. Yes, i used GNUwin32 as SourceForge "complete" of Wget had no EXE. Is there some other (compiled, complete) source i should get? |
#60
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Auric__ wrote:
Robert Baer wrote: everonvietnam2016 wrote: sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i dùng, dịch vụ tuyệt trần, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả No kapish. Firstly, on my computer, all i see is a strange mix of characters from the 512 ASCII set. Secondly, i would not be able to read or understand your language even if it was elegantly rendered. It's Vietnamese. *Lots* of accented characters. Also, it's spam. Yes, it was easy to recognize that was Vietnamese. How the heck did you figure out that it was spam? |
#61
![]() |
|||
|
|||
![]()
sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i sá»* dụng, dịch vụ ráo trá»i, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả
|
#62
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
Auric__ wrote: Robert Baer wrote: [snip] CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff wget --no-check-certificate --output-document=5960_002.TXT --output-file=log002.TXT https://www.nsncenter.com/NSNSearch?...and%20%22ELECT RO N%20TUBE%22&PageNumber=2 That wget line performs as expected for me: 5960_002.TXT contains valid HTML (although I haven't made any attempt to check the data; it looks like most of the page is CSS) and log002.TXT is a typical wget log of a successful transfer. As for truncating the filenames, if I remove the --output-document switch, the filename I get is NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2 [snip] If you're talking about GNUwin32, that version is years out of date. You must have a different version of Wget; whatever i do on the command line,including the "trick" of restrict-file-names=nocontrol, i get a buggered path name plus the response &PageNumber not recognized. Exactly same results in Win2K, WinXP or in Win7. Hmm. Well... it could be that your copy of wget was compiled with old path length limits (260 characters). I suppose the best thing to do there is to try a different copy. Yes, i used GNUwin32 as SourceForge "complete" of Wget had no EXE. Is there some other (compiled, complete) source i should get? Just google "wget windows" (without quotes) and start poking around. Download a few different versions and see if any of them work for you. -- Stupid railroad plot. |
#63
![]()
Posted to microsoft.public.excel.programming
|
|||
|
|||
![]()
Robert Baer wrote:
Auric__ wrote: Robert Baer wrote: everonvietnam2016 wrote: sản phẩm tốt, giá rẻ, chất lượng, an toÃ*n cho ngưá»i dùng, dịch vụ tuyệt trần, lúc nÃ*o có Ä‘iá»u kiện qua á»§ng há»™ nhé. chúc cá»*a hÃ*ng lÃ*m ăn hiệu quả No kapish. Firstly, on my computer, all i see is a strange mix of characters from the 512 ASCII set. Secondly, i would not be able to read or understand your language even if it was elegantly rendered. It's Vietnamese. *Lots* of accented characters. Also, it's spam. Yes, it was easy to recognize that was Vietnamese. How the heck did you figure out that it was spam? Well, it's Vietnamese in an almost-entirely English group, replying to a thread that's entirely in English. Also, courtesy of Google translate: Is good, cheap, quality, seated * n for users, divine service, at nÃ*o conditional support through offline. Wish c »* a hà * ng là * m efficiently The text is buggered on my end so I can't get a complete translation, but you can see that it's meant to advertise *something*. The post immediately preceding it was also Vietnamese spam, complete with a link. (If you didn't see it, don't worry about it.) Also, it's deleted from the Google archive. That's a pretty good sign right there. -- The only thing your dreams will land you is dead in a ditch. |
#64
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Currently, it's ready to fully automate, but seems to have a snag
writing past the 1st parent page's child pages. Turns out the problem was code not waiting for the browser not busy. I switched to using URLDownloadToFile() at this point because it's orders of magnitude faster. Using the browser/twxtbox on a sheet served well for getting the process code nailed down, but that was only a temp situation during dev. The links error out about mid pg7 thru pg10 as I time tested only pgs 1thru10: -this took 50.89 secs! I'll do some housekeeping of the code and post a download link to the file... -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#65
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Excel macros are SO... undocumented.
Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ...for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#66
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
"Problem" with Excel, is that there are MANY ways to get what is
needed, and there is NO WAY of discovering _any_ of them; the "help" document is worse than useless in that manner. I have found that URLDownloadToFile() to be non-functional for https sources. I disagree because it's working in my project I posted the download for! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#67
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) I did not even try cURL as the explanation was just too dern complicated. Fiddled in Excel,as it has so many different ways to do something specific. So, this is skeleton of what i have: Workbooks.Open Filename:=openFYL$ 'opens as R/O, no HD space taken then.. With Worksheets(1) ' .Copy ''do not need; saves BOOK space .SaveAs sav$ 'do not know how to close when done ' above creates the file described; that takes HD space, about 300K End With IMMEDIATELY after the "End With", a folder is created with useless metadata info; do not know how to close when done. WARNING: Scheme works only in XP and Win7. If in XP, at about 150 files,one gets a PHONY "HD is full" warning and one must exit Excel so as to be able to delete processed (and so unwanted) files. I say PHONY because the system showed NO CHANGE in HD free space, never mind those files take about 500MB. Furthermore, in Win7, these files show up in a folder the system KNOWS NOTHING ABOUT..Windows Explorer does not show C:\Documents which IS accessible; C:\<sysname\MY Documents is shown and CANNOT be accessed. Instead of the Excel program crashing, the system is shut down and locked. YET other reasons I hate Win7. I don't follow what you're talking about here! What does it have to do with the download I linked to? -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#68
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Currently, it's ready to fully automate, but seems to have a snag writing past the 1st parent page's child pages. Turns out the problem was code not waiting for the browser not busy. I switched to using URLDownloadToFile() at this point because it's orders of magnitude faster. Using the browser/twxtbox on a sheet served well for getting the process code nailed down, but that was only a temp situation during dev. "Problem" with Excel, is that there are MANY ways to get what is needed, and there is NO WAY of discovering _any_ of them; the "help" document is worse than useless in that manner. I have found that URLDownloadToFile() to be non-functional for https sources. The links error out about mid pg7 thru pg10 as I time tested only pgs 1thru10: -this took 50.89 secs! I'll do some housekeeping of the code and post a download link to the file... |
#69
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) I did not even try cURL as the explanation was just too dern complicated. Fiddled in Excel,as it has so many different ways to do something specific. So, this is skeleton of what i have: Workbooks.Open Filename:=openFYL$ 'opens as R/O, no HD space taken then.. With Worksheets(1) ' .Copy ''do not need; saves BOOK space .SaveAs sav$ 'do not know how to close when done ' above creates the file described; that takes HD space, about 300K End With IMMEDIATELY after the "End With", a folder is created with useless metadata info; do not know how to close when done. WARNING: Scheme works only in XP and Win7. If in XP, at about 150 files,one gets a PHONY "HD is full" warning and one must exit Excel so as to be able to delete processed (and so unwanted) files. I say PHONY because the system showed NO CHANGE in HD free space, never mind those files take about 500MB. Furthermore, in Win7, these files show up in a folder the system KNOWS NOTHING ABOUT..Windows Explorer does not show C:\Documents which IS accessible; C:\<sysname\MY Documents is shown and CANNOT be accessed. Instead of the Excel program crashing, the system is shut down and locked. YET other reasons I hate Win7. |
#70
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
If you're referring to the substitute 'page error' text put in place of
missing item info, ..well that might be misleading you. Fact is, starting with item7 on pg7 there is no item info on any pages I checked manually in the browser (up to pg100). Perhaps you could rephrase that to "No Data Available"!? -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#71
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) Holy S*! I did about 30 pages by hand; quit as rather tiresome and total pages unknown (MORE than 999). Never saw the fail you saw. Difference is that you used the word "and"; technically (i think) that should not affect results. Also, you got items I am interested in, and after processing 503 pages, i did NOT get those. In both cases, there were a lot of duplicate records (government data, what else can you expect?). In your sample, there were 73 useful records containing 43 unique records. There may be some that i am not interested in, but there definitely ARE those i did not find that i am interested in. You could 'dump' the file into a worksheet and filter out the dupes easily enough. In my sample, there were 3782 unique records,and (better sit down), only 15 were interesting. Crappy odds. Hopefully, when i call them, someone that has some experience and knowledge of how their sort criteria works, will answer the phone. Last time i called,i got a new guy; no help other than "use Google". Those are dynamic web pages and so are database driven. Surely there's a repository database for this info somewhere other than NSN? You have done a masterful job! Label it !DONE! please. Thanks a lot. Happy to be of help; -I found the project rather interesting! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#72
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) Holy S*! I did about 30 pages by hand; quit as rather tiresome and total pages unknown (MORE than 999). Never saw the fail you saw. Difference is that you used the word "and"; technically (i think) that should not affect results. Also, you got items I am interested in, and after processing 503 pages, i did NOT get those. In both cases, there were a lot of duplicate records (government data, what else can you expect?). In your sample, there were 73 useful records containing 43 unique records. There may be some that i am not interested in, but there definitely ARE those i did not find that i am interested in. In my sample, there were 3782 unique records,and (better sit down), only 15 were interesting. Crappy odds. Hopefully, when i call them, someone that has some experience and knowledge of how their sort criteria works, will answer the phone. Last time i called,i got a new guy; no help other than "use Google". You have done a masterful job! Label it !DONE! please. Thanks a lot. |
#73
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
I've uploaded a new version that skips dupes, and flags missing item
info. (See the new 'Test' file) This version also runs orders of magnitude faster! -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#74
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
GS wrote: Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) I did not even try cURL as the explanation was just too dern complicated. Fiddled in Excel,as it has so many different ways to do something specific. So, this is skeleton of what i have: Workbooks.Open Filename:=openFYL$ 'opens as R/O, no HD space taken then.. With Worksheets(1) ' .Copy ''do not need; saves BOOK space .SaveAs sav$ 'do not know how to close when done ' above creates the file described; that takes HD space, about 300K End With IMMEDIATELY after the "End With", a folder is created with useless metadata info; do not know how to close when done. WARNING: Scheme works only in XP and Win7. If in XP, at about 150 files,one gets a PHONY "HD is full" warning and one must exit Excel so as to be able to delete processed (and so unwanted) files. I say PHONY because the system showed NO CHANGE in HD free space, never mind those files take about 500MB. Furthermore, in Win7, these files show up in a folder the system KNOWS NOTHING ABOUT..Windows Explorer does not show C:\Documents which IS accessible; C:\<sysname\MY Documents is shown and CANNOT be accessed. Instead of the Excel program crashing, the system is shut down and locked. YET other reasons I hate Win7. I don't follow what you're talking about here! What does it have to do with the download I linked to? In the meantime, i took a stab of a "pure" Excel program to get the data. Whatever you do and more eXplicity how you do the search, it yields results that i do not see. Manually downloading the first page for a manual search, I get: 5960 REGULATOR AND "ELECTRON TUBE" About 922 results (1 ms) 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 5960-00-897-8418 and 5960 AND REGULATOR AND "ELECTRON TUBE" About 104 results (16 ms) 5960-00-503-9529 5960-00-504-8401 5960-01-035-3901 5960-01-029-2766 5960-00-617-4105 5960-00-729-5602 5960-00-826-1280 5960-00-754-5316 5960-00-962-5391 5960-00-944-4671 5960-00-897-8418 Note they are very different, and the second search "gets" a a lot less. Also neither search gets anything you got, and i am interested in how you did it. |
#75
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
If you're referring to the substitute 'page error' text put in place of missing item info, ..well that might be misleading you. Fact is, starting with item7 on pg7 there is no item info on any pages I checked manually in the browser (up to pg100). Perhaps you could rephrase that to "No Data Available"!? Machs nicht. I also looked manually and you are correct. Why the heck they have NSNs that do not relate to a part is puzzling, but, hey, it *IS* the government. Not useful to what i need, but still nice to know. |
#76
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
GS wrote: Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) Holy S*! I did about 30 pages by hand; quit as rather tiresome and total pages unknown (MORE than 999). Never saw the fail you saw. Difference is that you used the word "and"; technically (i think) that should not affect results. Also, you got items I am interested in, and after processing 503 pages, i did NOT get those. In both cases, there were a lot of duplicate records (government data, what else can you expect?). In your sample, there were 73 useful records containing 43 unique records. There may be some that i am not interested in, but there definitely ARE those i did not find that i am interested in. You could 'dump' the file into a worksheet and filter out the dupes easily enough. * Yes, i did that; getting those 43 unique records. In my sample, there were 3782 unique records,and (better sit down), only 15 were interesting. Crappy odds. Hopefully, when i call them, someone that has some experience and knowledge of how their sort criteria works, will answer the phone. Last time i called,i got a new guy; no help other than "use Google". Those are dynamic web pages and so are database driven. Surely there's a repository database for this info somewhere other than NSN? * Prolly not dynamic as NSNs do not change except possible additions on rare occasion. Certainly, a SPECIFIC search (so far) always gives the same results. Look at a previous response concerning 100% manual search results, first page only. 5960 AND REGULATOR AND "ELECTRON TUBE" About 104 results 5960 REGULATOR AND "ELECTRON TUBE" About 922 results Totally different and neither match your results. And your results look superior. You have done a masterful job! Label it !DONE! please. Thanks a lot. Happy to be of help; -I found the project rather interesting! |
#77
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Also neither search gets anything you got, and i am interested in
how you did it. If you study the file I gave you, you'll see how both methods are working. The worksheet implements all manual parsing so you can study each part of the process as well as the web page source structure; the *AutoParse* macro collects the data and writes it to the file. -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#78
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
Robert Baer wrote:
GS wrote: GS wrote: Excel macros are SO... undocumented. Need a WORKING example for reading the HTML source a URL (say http://www.oil4lessllc.org/gTX.htm) Thanks. Look here... https://app.box.com/s/23yqum8auvzx17h04u4f ..for *ParseWebPages.zip*, which contains: ParseWebPages.xls NSN_5960.txt (Blank data file with fieldnames only in line1 NSN_5960_Test.txt (Results for 1st 20 pages) Holy S*! I did about 30 pages by hand; quit as rather tiresome and total pages unknown (MORE than 999). Never saw the fail you saw. Difference is that you used the word "and"; technically (i think) that should not affect results. Also, you got items I am interested in, and after processing 503 pages, i did NOT get those. In both cases, there were a lot of duplicate records (government data, what else can you expect?). In your sample, there were 73 useful records containing 43 unique records. There may be some that i am not interested in, but there definitely ARE those i did not find that i am interested in. You could 'dump' the file into a worksheet and filter out the dupes easily enough. * Yes, i did that; getting those 43 unique records. In my sample, there were 3782 unique records,and (better sit down), only 15 were interesting. Crappy odds. Hopefully, when i call them, someone that has some experience and knowledge of how their sort criteria works, will answer the phone. Last time i called,i got a new guy; no help other than "use Google". Those are dynamic web pages and so are database driven. Surely there's a repository database for this info somewhere other than NSN? * Prolly not dynamic as NSNs do not change except possible additions on rare occasion. Certainly, a SPECIFIC search (so far) always gives the same results. Look at a previous response concerning 100% manual search results, first page only. 5960 AND REGULATOR AND "ELECTRON TUBE" About 104 results 5960 REGULATOR AND "ELECTRON TUBE" About 922 results Totally different and neither match your results. And your results look superior. You have done a masterful job! Label it !DONE! please. Thanks a lot. Happy to be of help; -I found the project rather interesting! I am getting more confused. Search term used and response: 5960 AND REGULATOR AND "ELECTRON TUBE" About 104 results 5960 REGULATOR AND "ELECTRON TUBE" About 922 results 5960 regulator and "ELECTRON TUBE" About 3134377 results Use of the second two give exactly the same list for the first page and the last term is the one you used in your program. The results like i previously said, are completely different WRT the first term and your program (3 different results). Notice the 3.1 million results when lower case is used for "regulator"; i think the database "engine" is thrashing around in what is almost a useless attempt. BUT. That thrashing produces very useful results (after sort and consolidate). SO. (1) results are dependent on the form / format of the search term used. (2) results depend on the (in this case) Excel procedure used that does the access and fetch. Now i know rather well, that Excel mangles the HTML information when it is imported; most especially the primary page. I had my Excel program working to parse the primary HTML page AS SEEN BY THE HUMAN EYE ON THE WEB, and i had to make a number of changes to accommodate what Excel gave me. Therefore, on that basis, i have a rather strong suspicion that what Excel SENDS to the web for a search is quite different than what we think. Comments? Suggestions, primarily to get it more efficient BUT ALSO give it all to us? PS: How i get the data: Workbooks.Open Filename:=openFYL$ 'YES..opens as R/O With Worksheets(1) ' .Copy ''do not need; saves BOOKn space .SaveAs sav$ 'do not know how to close when do not need End With |
#79
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
The process result after copy/paste a web page into a worksheet is
*entirely different* than reading the webpage source. Both my examples read webpage source *not the rendered page you see in the browser*! The fault of using copy/paste on a webpage is that different browsers *often* won't necessarily display content the same way. If you read the source in the tmp.txt you 'should' very quickly realize these pages are a template wherein data is dynamically inserted from a database via script embeded in the source html. I use the last URL query *you provided* in both the worksheet approach and the AutoParse() sub. The tmp.txt file shows the complete webpage source, whereas txtPgSrc shows the webpage source *as rendered* in WebBrowser1. WebBrowser1 will display whatever is in the URL cell above it; AutoParse uses the string defined as Public Const gsUrl1$. You need to decide what URL string you want to run with and set both the URL cell and gsUrl1 strings to that. Scrap using the copy/paste webpage approach altogether because it's unreliable at best and renders inconsistent results at worst! (*Clue:* Note how WebBrowser1 wraps content but xtPgSrc does not!) You are collecting data here, NOT capturing webpage content as rendered. The data displays according to the source behind the rendered webpage. That source is structured to be dynamic in terms of what data is rendered based on the URL string, and HOW it displays depends on the browser being used to view the data. In this case, WebBrowser1 uses the same engine as Internet Explorer, and what you see on your screen *depends on* which version of that engine is running! If you've ever used HTML to build webpages you'd know (or at the very least *should know*) instinctively that the code source is the only reliable element to work with. HTH -- Garry Free usenet access at http://www.eternal-september.org Classic VB Users Regroup! comp.lang.basic.visual.misc microsoft.public.vb.general.discussion --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
#80
![]()
Posted to microsoft.public.excel.programming,microsoft.public.excel
|
|||
|
|||
![]()
GS wrote:
The process result after copy/paste a web page into a worksheet is *entirely different* than reading the webpage source. Both my examples read webpage source *not the rendered page you see in the browser*! The fault of using copy/paste on a webpage is that different browsers *often* won't necessarily display content the same way. If you read the source in the tmp.txt you 'should' very quickly realize these pages are a template wherein data is dynamically inserted from a database via script embeded in the source html. I use the last URL query *you provided* in both the worksheet approach and the AutoParse() sub. The tmp.txt file shows the complete webpage source, whereas txtPgSrc shows the webpage source *as rendered* in WebBrowser1. WebBrowser1 will display whatever is in the URL cell above it; AutoParse uses the string defined as Public Const gsUrl1$. You need to decide what URL string you want to run with and set both the URL cell and gsUrl1 strings to that. Scrap using the copy/paste webpage approach altogether because it's unreliable at best and renders inconsistent results at worst! (*Clue:* Note how WebBrowser1 wraps content but xtPgSrc does not!) You are collecting data here, NOT capturing webpage content as rendered. The data displays according to the source behind the rendered webpage. That source is structured to be dynamic in terms of what data is rendered based on the URL string, and HOW it displays depends on the browser being used to view the data. In this case, WebBrowser1 uses the same engine as Internet Explorer, and what you see on your screen *depends on* which version of that engine is running! If you've ever used HTML to build webpages you'd know (or at the very least *should know*) instinctively that the code source is the only reliable element to work with. HTH Maybe i was not too clear. Case one: Using a browser, log to https://www.nsncenter.com/ and give it a search term: 5960®ULATOR&"ELECTRON TUBE" in the NSN box, and click on the WebFLIS Search green button. Then use the browser "File" pulldown, select Save Page As and modify the extension to .TXT The resulting file is a bit different than what one sees in other methods. Case two: Choose a method of getting the search results; a given search term will always produce the same results (ie: reproducible), and small changes of the search term may give different results - and THOSE DIFFERENCES are some of what i am talking about. Case three: Choose a given search term, and compare results between various methods; DIFFERENCES may be huge, also some of what i am talking about. In case three, with your program, whatever is happening gives a radically different result. And that result is VERY useful. For some unknown reason, your program/macro refuses to run, and gives the following error message: "Can't find project or library". Would you be so kind as to modify the search term in your program to 5960®ULATOR&"ELECTRON TUBE" and run it? and please send the results? |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
EOF Parse Text file | Excel Programming | |||
Parse a txt file and save as csv? | Excel Programming | |||
parse from txt file | Excel Programming | |||
Parse File Location | Excel Worksheet Functions | |||
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) | Excel Programming |