View Single Post
  #45   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
Robert Baer Robert Baer is offline
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web CORRECTION#3

GS wrote:
Typos...

Robert,
Here's what I have after parsing 'parent' pages for a list of its links:

N/A
5960-00-503-9529
5960-00-504-8401
5960-01-035-3901
5960-01-029-2766
5960-00-617-4105
5960-00-729-5602
5960-00-826-1280
5960-00-754-5316
5960-00-962-5391
5960-00-944-4671

This is pg1 where the 1st link doesn't contain "5960" and so will be
ignored.

Each link's text is appended to this URL to bring up its 'child' pg:

https://www.nsncenter.com/NSN/

Each child page is parsed for the following 4 lines:

<TD style="VERTICAL-ALIGN: middle" align=center<A
href="/PartNumber/GV3S2800"GV3S2800</A</TD
<TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap
align=center&nbsp;&nbsp;<A href="/CAGE/63060"63060</A&nbsp;&nbsp;</TD
<TD style="VERTICAL-ALIGN: middle" align=center&nbsp;&nbsp;<A
href="/CAGE/63060"<IMG class=img-thumbnail
src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90
height=45</A&nbsp;&nbsp;</TD
<TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD

I'm stripping html syntax to get this data:

Line1: PartNumber/GV3S2800
Line2: CAGE/63060
Line3: https://placehold.it/90x45?text=No%0DImage%0DYet
Line4: HEICO OHMITE LLC


The output file has these fieldnames in the 1st line:

NSN Item#,Description,Part#,MCRL,CAGE,Source

I left the 3rd line URL out since, outside its host webpage, it'll be
useless to you. I need to know from you if the 3rd line URL is needed!

Otherwise, the output file will have 1 line per item so it can be used
as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion
for filename...

I could extend the collected data to include...

Reference Number/DRN_3570
Entity Code/DRN_9250
Category Code/DRN_2910
Variation Code/DRN_4780

..where the fieldnames would then be:

Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR

The 1st record will be:

5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE
LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780

Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb.
You could have 1000 parent pgs of data stored in a 1Mb file.

Your feedback is appreciated...


Like i said, PERFECT!
And you are correct, do not need line 3 nor the extended data.
Please check my other answer for a corrected search term which needs
a !corrected! human-readable version of the URL is:
https://www.nsncenter.com/NSNSearch?q=5960 regulator and "ELECTRON
TUBE"&PageNumber=1
The %20 is a virtual space, and the %22 is a virtual quote.
((guess that is the proper term))