About Us

Robert Baer

Auric__ wrote:
Robert Baer wrote:

Auric__ wrote:
Robert Baer wrote:

And assuming a fix, what can i do about the OPEN command/syntax?
// What i did in Excel:
S$ = "D:\Website\Send .Hot\****"
tmp = Environ("TEMP")& "\"& S$

The contents of the variable S$ at this point:

S$ = "C:\Users\auric\D:\Website\Send .Hot\****"

Do you see the problem?

Also, as Garry pointed out, cleanup should happen automatically. The
"Kill"
keyword deletes files.

Try this code:

[snip]

Grumble..do not understand well enough to get working.
Now i do not know what i had that fully worked with the gTX.htm file.

The following "almost" works; it fails on the open.

You know, one of us is confused, and I'm not entirely sure it isn't me. I've
given you (theoretically) working code twice now, and yet you insist on
making some pretty radical changes that DON'T ****ING WORK!

So, let's step back from the coding for a moment, and let's have you explain
***EXACTLY*** what it is you want done. Give examples like, "given data X, I
want to do Y, with result Z."

Unless I get a clearer explanation of what you're trying to do, I'm done
with this thread.

I wish to read and parse every page of
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
where the page number goes from 5 to 999.
On each page, find "<a href="/NSN/5960 [it is longer, but that is the
start].
Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a
new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and
find the line ending "(MCRL)".
Read abut 4 lines to <a href="/PartNumber/ which is <a
href="/PartNumber/GV4S1400" in this case.
save/write that line plus the next three; close this secondary online
URL and step to next "<a href="/NSN/5960 to process the same way.
Continue to end of the page, close that URL and open the next page.

Crude code:
CLOSE ' PRS5960.BAS (QuickBasic)
' watch linewrap below..
SRC1$ = "https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
SRC2$ = "https://www.nsncenter.com/NSN/5960" 'example only
FSC$ = "/NSN/5960"
OPEN "FSC5960.TXT" FOR APPEND AS #9
' Let page number run from 05 to 39 to read existing files
FOR PG = 5 TO 39
A$ = ""
FPG$ = RIGHT$("0" + MID$(STR$(PG), 2), 2)
' These files, FPG$ + ".txt" are copies from the web
OPEN FPG$ + ".txt" FOR INPUT AS #1
ON ERROR GOTO END1
PRINT FPG$ + ".txt", 'is screen note to me
WHILE NOT EOF(1)
WHILE INSTR(A$, FSC$) = 0 'skip 7765 lines of junk
LINE INPUT #1, A$ 'look for <a href="/NSN/5960-00-754-5782" Class= ETC
WEND
P = INSTR(A$, FSC$) + 9: FPG2$ = SRC2$ + MID$(A$, P, 12)
NSN$ = "5960" + MID$(A$, P, 12)
PRINT NSN$ 'is screen note to me
AHREF$ = ".." + FSC$ + MID$(A$, P, 12)
'Need URL FPG2$ or .. a href to get balance of data
' See comments above this program
PRINT #9, NSN$
LINE INPUT #1, A$
WEND
END1:
RESUME LAB
LAB:
CLOSE #1
NEXT PG
CLOSE
SYSTEM
**
Note the Function URLDownloadToFile does not allow spaces; there is
one in the "page" URL.
Problem #2, the Function URLDownloadToFile does not allow https
website URLs.
Other than those problems, i have everything else working fine.

GS[_6_]

So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from
https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4):

1st occurance of <a href="/NSN/5960 is at line 7878;

1st occurance of (MCRL) is at line 7931;

1st occurance after that of <a href="/PartNumber" is this at line 7951;
<td align="center" style="vertical-align: middle;"<a
href="/PartNumber/GV4S1400"GV4S1400</a</td

and the next 3 lines a
<td style="width: 125px; height: 60px; vertical-align: middle;"
align="center" nowrap  <a
href="/CAGE/63060"63060</a  </td
<td align="center" style="vertical-align: middle;"  <a
href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a  </td
<td text-align="center" style="vertical-align: middle;"<a title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td

So you want to go to the next page linked to and repeat the process?

At this point my Excel sheet has been modified as follows:

Source | NSN Item# | Description | Part# | MCRL#
Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a

General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 |
5074477
line1
line2
line3

...and so on.

So far, I'm working with text files and so I'm inclined to append each
item to a file named "ElectronTube_NSN5960.txt". File contents for the
2 items above would be structured so the 1st line contains headings
(data fields) so it works properly with ADODB. (Note that I use a comma
as delimiter, and the file does not contain any blank lines)...

Source,NSN Item#,Description,Part#,MCRL#
Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a
General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477
<a href="/CAGE/1VPW8"1VPW8</a
<a href="/CAGE/1VPW8"<img class="img-thumbnail"
src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8"
height=45 width=90 /</a
<a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS,
INC.</a

...where I have parsed off the CSS formatting text and html tags outside
<a...</a from the 3 lines. I'd likely convert the UCase to proer case
as well.

The file size is 653 bytes meaning a full page would be about 4kb; 1000
pages being about 4mb. That's 44 lines per page after the fields line.

A file this size can be easily handled via ADO recordset or std VB file
I/O functions/methods. Loading into an array (vData) puts fields in
vData(0) and records starting at vData(1), and looping would Step 4.

I really don't have the time/energy (I have Lou Gehrig's) to get any
more involved with your project due to current commitments. I just felt
it might be worth explaining how I'd handle your task in the hopes it
would be helpful to you reaching a viable solution. I bid you good
wishes going forward...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from
https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4):

1st occurance of <a href="/NSN/5960 is at line 7878;

1st occurance of (MCRL) is at line 7931;

1st occurance after that of <a href="/PartNumber" is this at line 7951;
<td align="center" style="vertical-align: middle;"<a
href="/PartNumber/GV4S1400"GV4S1400</a</td

and the next 3 lines a
<td style="width: 125px; height: 60px; vertical-align: middle;"
align="center" nowrap  <a
href="/CAGE/63060"63060</a  </td
<td align="center" style="vertical-align: middle;"  <a
href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a  </td
<td text-align="center" style="vertical-align: middle;"<a title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td

So you want to go to the next page linked to and repeat the process?

At this point my Excel sheet has been modified as follows:

Source | NSN Item# | Description | Part# | MCRL#
Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a

General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477
line1
line2
line3

..and so on.

So far, I'm working with text files and so I'm inclined to append each
item to a file named "ElectronTube_NSN5960.txt". File contents for the 2
items above would be structured so the 1st line contains headings (data
fields) so it works properly with ADODB. (Note that I use a comma as
delimiter, and the file does not contain any blank lines)...

Source,NSN Item#,Description,Part#,MCRL#
Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a
General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477
<a href="/CAGE/1VPW8"1VPW8</a
<a href="/CAGE/1VPW8"<img class="img-thumbnail"
src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8"
height=45 width=90 /</a
<a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS,
INC.</a

..where I have parsed off the CSS formatting text and html tags outside
<a...</a from the 3 lines. I'd likely convert the UCase to proer case
as well.

The file size is 653 bytes meaning a full page would be about 4kb; 1000
pages being about 4mb. That's 44 lines per page after the fields line.

A file this size can be easily handled via ADO recordset or std VB file
I/O functions/methods. Loading into an array (vData) puts fields in
vData(0) and records starting at vData(1), and looping would Step 4.

I really don't have the time/energy (I have Lou Gehrig's) to get any
more involved with your project due to current commitments. I just felt
it might be worth explaining how I'd handle your task in the hopes it
would be helpful to you reaching a viable solution. I bid you good
wishes going forward...
* Thanks for the guide.

You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" &
PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.

I think that by now you have it all figured out.

In snooping around,i have just stumbled on the ADODB scheme,and what
little i have found so far it looks very promising.
Only one example which does not work (examples NEVER work) and zero
explanations so far.
It seems that with the proper code, that ADODB would allow me to copy
those first files to a HD.

Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?

Thanks again.

GS[_6_]

You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
& PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.

I got Source, NSN Part#, Description from the 1st file. The NSN Item#
links to the 2nd file.

<snip
Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?

I use std VB file I/O not ADODB. Initial procedure was to copy/paste
page source into Textpad and save as Tmp.txt. Then load the file into
an array and parse from there.

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts, but haven't had the time. I
assume this would definitely give you an advantage over trying to
automate IE, but I need to research using it. I do have URL functions
built into my fpSpread.ocx for doing this stuff, but that's an
expensive 3rd party AX component. Otherwise, doing this from Excel
isn't something I'm familiar with.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

GS[_6_]

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts

While I'm on pause waiting to consult with client on my current
project...

This is doable; -I have a userform with web browser, a textbox, and
some buttons.

The Web Browser doesn't display AddressBar/StatusBar for some reason,
even though these props are set 'True'. (Initial URL (pg1) is
hard-coded as a result) You navigate to parent pages using Next/Last
buttons, starting with pg1 on load. Optionally, you can enter a page#
in a GoTo box.

The browser lets you select links, and you load its current document
page source into txtViewSrc via btnViewSrc. This action also Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired
lines at present, but I will have it appending them to file shortly.
This file will be structured same as illustrated earlier. I think this
could be fully automated after I see how the links are define in their
innerHTML.

For now, I'll provide a button to write found lines because it gives an
opportunity to preview the data going into your file. This will happen
via loading found lines into vaLinesOut() which is sized 0to3. This
will make the search sequence important so the output file has its
lines in the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be orders
of magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts

While I'm on pause waiting to consult with client on my current project...

This is doable; -I have a userform with web browser, a textbox, and some
buttons.

The Web Browser doesn't display AddressBar/StatusBar for some reason,
even though these props are set 'True'. (Initial URL (pg1) is hard-coded
as a result) You navigate to parent pages using Next/Last buttons,
starting with pg1 on load. Optionally, you can enter a page# in a GoTo box.

The browser lets you select links, and you load its current document
page source into txtViewSrc via btnViewSrc. This action also Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired
lines at present, but I will have it appending them to file shortly.
This file will be structured same as illustrated earlier. I think this
could be fully automated after I see how the links are define in their
innerHTML.

For now, I'll provide a button to write found lines because it gives an
opportunity to preview the data going into your file. This will happen
via loading found lines into vaLinesOut() which is sized 0to3. This will
make the search sequence important so the output file has its lines in
the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be orders of
magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

Way beyond me.
If in HTML one can copy an <a href="..." to the hard drive, then
that is all i need.

GS[_6_]

GS wrote:
I thought I'd take a look at going with a userform and MS Web
Browser
control for more flexible programming opts

While I'm on pause waiting to consult with client on my current
project...

This is doable; -I have a userform with web browser, a textbox, and
some
buttons.

The Web Browser doesn't display AddressBar/StatusBar for some
reason,
even though these props are set 'True'. (Initial URL (pg1) is
hard-coded
as a result) You navigate to parent pages using Next/Last buttons,
starting with pg1 on load. Optionally, you can enter a page# in a
GoTo box.

The browser lets you select links, and you load its current
document
page source into txtViewSrc via btnViewSrc. This action also
Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your
desired
lines at present, but I will have it appending them to file
shortly.
This file will be structured same as illustrated earlier. I think
this
could be fully automated after I see how the links are define in
their
innerHTML.

For now, I'll provide a button to write found lines because it
gives an
opportunity to preview the data going into your file. This will
happen
via loading found lines into vaLinesOut() which is sized 0to3. This
will
make the search sequence important so the output file has its lines
in
the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured
for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be
orders of
magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

Way beyond me.
If in HTML one can copy an <a href="..." to the hard drive, then
that is all i need.

Just another approach, since you seem to be having difficulty getting
URLDownloadToFile() to work.

My approach reads innerHTML of web pages and outputs to txt file. Not
sure why you want to grab html and save to disc given the file size is
a concern. My approach puts parsed data from all 999 pages into a txt
file less than 4mb in size. Once the individual steps have been
optimized, automating the entire process will be easy. (I'll leave that
part for you to do however you want it to work)

I will post the contents of my fParseWebPages.frm file. You will need
to set a ref to the Microsoft Web Browser to use it.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" &
PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.

I got Source, NSN Part#, Description from the 1st file. The NSN Item#
links to the 2nd file.

<snip
Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?

I use std VB file I/O not ADODB. Initial procedure was to copy/paste
page source into Textpad and save as Tmp.txt. Then load the file into an
array and parse from there.

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts, but haven't had the time. I
assume this would definitely give you an advantage over trying to
automate IE, but I need to research using it. I do have URL functions
built into my fpSpread.ocx for doing this stuff, but that's an expensive
3rd party AX component. Otherwise, doing this from Excel isn't something
I'm familiar with.

Check. I know QBASIC fairly well, so a lot of that knowledge crosses
over to VB.
Someone here was kind enough to give me a full working program that
can be used to copy a URL source to a temp file on the HD.
Once available all else is very simple and straight forward.
The rub is that function (or something it uses) does not allow a
space in the URL,AND also does not allow https.
So, i need two work-arounds, and the https part would seem to be the
worst.
I do not know how it works, what DLLs/libraries it calls; no useful
information is available.
It is:
Declare Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As Long, _
ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Only the well-known keywords can be found; 'urlmon', 'pCaller',
'szURL', and 'szFileName' are unknowns and not findable in the so-called
VB help.
And there are no examples; the few ranDUMB ones are incomplete and/or
do not work..

I do not see how you use std VB file I/O; AFAIK one cannot open a web
page as if it was a file.

GS[_6_]

I wish to read and parse every page of
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
where the page number goes from 5 to 999.
On each page, find "<a href="/NSN/5960 [it is longer, but that is
the start].
Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open
a new related page "https://www.nsncenter.com/NSN/5960-00-831-8683"
and find the line ending "(MCRL)".
Read abut 4 lines to <a href="/PartNumber/ which is <a
href="/PartNumber/GV4S1400" in this case.
save/write that line plus the next three; close this secondary online
URL and step to next "<a href="/NSN/5960 to process the same way.
Continue to end of the page, close that URL and open the next
page.

Robert,
Here's what I have after parsing 'parent' pages for a list of its
links:

N/A
5960-00-503-9529
5960-00-504-8401
5960-01-035-3901
5960-01-029-2766
5960-00-617-4105
5960-00-729-5602
5960-00-826-1280
5960-00-754-5316
5960-00-962-5391
5960-00-944-4671

This is pg1 where the 1st link doesn't contain "5960" and so will be
ignored.

Each link's text is appended to this URL to bring up its 'child' pg:

https://www.nsncenter.com/NSN/

Each child page is parsed for the following 4 lines:

<TD style="VERTICAL-ALIGN: middle" align=center<A
href="/PartNumber/GV3S2800"GV3S2800</A</TD
<TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap
align=center  <A
href="/CAGE/63060"63060</A  </TD
<TD style="VERTICAL-ALIGN: middle" align=center  <A
href="/CAGE/63060"<IMG class=img-thumbnail
src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90
height=45</A  </TD
<TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD

I'm stripping html syntax to get this data:

Line1: PartNumber/GV3S2800
Line2: CAGE/63060
Line3: https://placehold.it/90x45?text=No%0DImage%0DYet
Line4: HEICO OHMITE LLC

The output file has these filenames in the 1st line:

NSN Item#,Description,Part#,MCRL,CAGE,Source

I left the 3rd line URL out since, outside its host webpage, it'll be
useless to you. I need to know from you if the 3rd line URL is needed!

Otherwise, the output file will have 1 line per item so it can be used
as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion
for filename...

I could extend the collected data to include...

Reference Number/DRN_3570
Entity Code/DRN_9250
Category Code/DRN_2910
Variation Code/DRN_4780

...where the fieldnames would then be:

Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation

The 1st record will be:

5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE
LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780

Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You
could have 1000 parent pgs of data stored in a 1Mb file.

Your feedback is appreciated...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

GS[_6_]

Typos...

Robert,
Here's what I have after parsing 'parent' pages for a list of its
links:

N/A
5960-00-503-9529
5960-00-504-8401
5960-01-035-3901
5960-01-029-2766
5960-00-617-4105
5960-00-729-5602
5960-00-826-1280
5960-00-754-5316
5960-00-962-5391
5960-00-944-4671

This is pg1 where the 1st link doesn't contain "5960" and so will be
ignored.

Each link's text is appended to this URL to bring up its 'child' pg:

https://www.nsncenter.com/NSN/

Each child page is parsed for the following 4 lines:

<TD style="VERTICAL-ALIGN: middle" align=center<A
href="/PartNumber/GV3S2800"GV3S2800</A</TD
<TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle"
noWrap align=center  <A
href="/CAGE/63060"63060</A  </TD
<TD style="VERTICAL-ALIGN: middle" align=center  <A
href="/CAGE/63060"<IMG class=img-thumbnail
src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90
height=45</A  </TD
<TD style="VERTICAL-ALIGN: middle" text-align="center"<A
title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD

I'm stripping html syntax to get this data:

Line1: PartNumber/GV3S2800
Line2: CAGE/63060
Line3: https://placehold.it/90x45?text=No%0DImage%0DYet
Line4: HEICO OHMITE LLC

The output file has these fieldnames in the 1st line:

NSN Item#,Description,Part#,MCRL,CAGE,Source

I left the 3rd line URL out since, outside its host webpage, it'll be
useless to you. I need to know from you if the 3rd line URL is
needed!

Otherwise, the output file will have 1 line per item so it can be
used as the db file "NSN_5960_ElectronTube.dat". I invite your
suggestion for filename...

I could extend the collected data to include...

Reference Number/DRN_3570
Entity Code/DRN_9250
Category Code/DRN_2910
Variation Code/DRN_4780

..where the fieldnames would then be:

Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR

The 1st record will be:

5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE
LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780

Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb.
You could have 1000 parent pgs of data stored in a 1Mb file.

Your feedback is appreciated...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
Typos...

Robert,
Here's what I have after parsing 'parent' pages for a list of its links:

N/A
5960-00-503-9529
5960-00-504-8401
5960-01-035-3901
5960-01-029-2766
5960-00-617-4105
5960-00-729-5602
5960-00-826-1280
5960-00-754-5316
5960-00-962-5391
5960-00-944-4671

This is pg1 where the 1st link doesn't contain "5960" and so will be
ignored.

Each link's text is appended to this URL to bring up its 'child' pg:

https://www.nsncenter.com/NSN/

Each child page is parsed for the following 4 lines:

<TD style="VERTICAL-ALIGN: middle" align=center<A
href="/PartNumber/GV3S2800"GV3S2800</A</TD
<TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap
align=center  <A href="/CAGE/63060"63060</A  </TD
<TD style="VERTICAL-ALIGN: middle" align=center  <A
href="/CAGE/63060"<IMG class=img-thumbnail
src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90
height=45</A  </TD
<TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD

I'm stripping html syntax to get this data:

Line1: PartNumber/GV3S2800
Line2: CAGE/63060
Line3: https://placehold.it/90x45?text=No%0DImage%0DYet
Line4: HEICO OHMITE LLC

The output file has these fieldnames in the 1st line:

NSN Item#,Description,Part#,MCRL,CAGE,Source

I left the 3rd line URL out since, outside its host webpage, it'll be
useless to you. I need to know from you if the 3rd line URL is needed!

Otherwise, the output file will have 1 line per item so it can be used
as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion
for filename...

I could extend the collected data to include...

Reference Number/DRN_3570
Entity Code/DRN_9250
Category Code/DRN_2910
Variation Code/DRN_4780

..where the fieldnames would then be:

Item#,Part#,MCRL,CAGE,Source,REF,ENT,CAT,VAR

The 1st record will be:

5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE
LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780

Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb.
You could have 1000 parent pgs of data stored in a 1Mb file.

Your feedback is appreciated...

Like i said, PERFECT!
And you are correct, do not need line 3 nor the extended data.
Please check my other answer for a corrected search term which needs
a !corrected! human-readable version of the URL is:
https://www.nsncenter.com/NSNSearch?q=5960 regulator and "ELECTRON
TUBE"&PageNumber=1
The %20 is a virtual space, and the %22 is a virtual quote.
((guess that is the proper term))

Robert Baer

GS wrote:
I wish to read and parse every page of
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
where the page number goes from 5 to 999.
On each page, find "<a href="/NSN/5960 [it is longer, but that is the
start].
Given the full number (eg: <a href="/NSN/5960-00-831-8683"), open a
new related page "https://www.nsncenter.com/NSN/5960-00-831-8683" and
find the line ending "(MCRL)".
Read abut 4 lines to <a href="/PartNumber/ which is <a
href="/PartNumber/GV4S1400" in this case.
save/write that line plus the next three; close this secondary online
URL and step to next "<a href="/NSN/5960 to process the same way.
Continue to end of the page, close that URL and open the next page.

Robert,
Here's what I have after parsing 'parent' pages for a list of its links:

N/A
5960-00-503-9529
5960-00-504-8401
5960-01-035-3901
5960-01-029-2766
5960-00-617-4105
5960-00-729-5602
5960-00-826-1280
5960-00-754-5316
5960-00-962-5391
5960-00-944-4671

This is pg1 where the 1st link doesn't contain "5960" and so will be
ignored.

Each link's text is appended to this URL to bring up its 'child' pg:

https://www.nsncenter.com/NSN/

Each child page is parsed for the following 4 lines:

<TD style="VERTICAL-ALIGN: middle" align=center<A
href="/PartNumber/GV3S2800"GV3S2800</A</TD
<TD style="HEIGHT: 60px; WIDTH: 125px; VERTICAL-ALIGN: middle" noWrap
align=center  <A href="/CAGE/63060"63060</A  </TD
<TD style="VERTICAL-ALIGN: middle" align=center  <A
href="/CAGE/63060"<IMG class=img-thumbnail
src="https://placehold.it/90x45?text=No%0DImage%0DYet" width=90
height=45</A  </TD
<TD style="VERTICAL-ALIGN: middle" text-align="center"<A title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</A</TD

I'm stripping html syntax to get this data:

Line1: PartNumber/GV3S2800
Line2: CAGE/63060
Line3: https://placehold.it/90x45?text=No%0DImage%0DYet
Line4: HEICO OHMITE LLC

The output file has these filenames in the 1st line:

NSN Item#,Description,Part#,MCRL,CAGE,Source

I left the 3rd line URL out since, outside its host webpage, it'll be
useless to you. I need to know from you if the 3rd line URL is needed!

Otherwise, the output file will have 1 line per item so it can be used
as the db file "NSN_5960_ElectronTube.dat". I invite your suggestion for
filename...

I could extend the collected data to include...

Reference Number/DRN_3570
Entity Code/DRN_9250
Category Code/DRN_2910
Variation Code/DRN_4780

..where the fieldnames would then be:

Item#,Part#,MCRL,CAGE,Source,Ref,Entity,Category,V ariation

The 1st record will be:

5960-00-503-9529,GV3S2800,3302008,63060,HEICO OHMITE
LLC,DRN_3570,DRN_9250,DRN_2910,DRN_4780

Output file size for 1 parent pg is 1Kb; for 10 parent pgs is 10Kb. You
could have 1000 parent pgs of data stored in a 1Mb file.

Your feedback is appreciated...

WOW!
Absolutely PERFECT!
You are correct, #1) do not need that line 3, and #2) do not need the
extended info.

File name(s) for PageNumber=1 I would use 5960_001.TXT,..to
PageNumber=999 I would use 5960_999.TXT and that would preserve order.
*OR*
Reading & parsing from PageNumber=1 to PageNumber=999,one could
append to same file (name NSN_5960.TXT); might as well - makes it easier
to pour into a single Excel file.
Either way is fine.

I have found a way to get rid of items that are not strictly electron
tubes and/or not regulators; that way you do not have to parse out these
"unfit" items from first page description. Use:
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1"
Naturally, PageNumber still goes from 1 to 999.
Note the implied "(", ")" and " "; human-readable "5960 regulator and
(ELECTRON TUBE)".
As far as i can tell, using that shows no undesirable parts.

Thanks!
PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

GS[_6_]

You are correct, #1) do not need that line 3, and #2) do not need
the extended info.

Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source

File name(s) for PageNumber=1 I would use 5960_001.TXT,..to
PageNumber=999 I would use 5960_999.TXT and that would preserve
order.
*OR*
Reading & parsing from PageNumber=1 to PageNumber=999,one could
append to same file (name NSN_5960.TXT); might as well - makes it
easier to pour into a single Excel file.
Either way is fine.

Ok, then ouput filename will be: NSN_5960.txt

I have found a way to get rid of items that are not strictly
electron tubes and/or not regulators; that way you do not have to
parse out these "unfit" items from first page description. Use:
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1"
Naturally, PageNumber still goes from 1 to 999.
Note the implied "(", ")" and " "; human-readable "5960 regulator
and (ELECTRON TUBE)".
As far as i can tell, using that shows no undesirable parts.

Works nice! Now I get 11 5960 items per parent page.

Thanks!
PS: i found WGET to be non-useful (a) it truncates the filename (b)
it buggers it to partial gibberish

What is WGET?

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
You are correct, #1) do not need that line 3, and #2) do not need the
extended info.

Ok then, fieldnames will be: Item#,Part#,MCRL,CAGE,Source

File name(s) for PageNumber=1 I would use 5960_001.TXT,..to
PageNumber=999 I would use 5960_999.TXT and that would preserve order.
*OR*
Reading & parsing from PageNumber=1 to PageNumber=999,one could append
to same file (name NSN_5960.TXT); might as well - makes it easier to
pour into a single Excel file.
Either way is fine.

Ok, then ouput filename will be: NSN_5960.txt

I have found a way to get rid of items that are not strictly electron
tubes and/or not regulators; that way you do not have to parse out
these "unfit" items from first page description. Use:
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=1"

Naturally, PageNumber still goes from 1 to 999.
Note the implied "(", ")" and " "; human-readable "5960 regulator and
(ELECTRON TUBE)".
As far as i can tell, using that shows no undesirable parts.

Works nice! Now I get 11 5960 items per parent page.

Thanks!
PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish

What is WGET?

WGET is a command line program that will copy contents of an URL to
the hard drive; it has various options, for SSL, i think for some
processing, for giving the output file a specific name, for recursion, etc.
Was still trying to find ways to copy the online file to the hard drive.

I still do not understand what magic you used.

Now, the nitty-gritty; in exchange for that nicely parsed file, what
do i owe you?

GS[_6_]

I still do not understand what magic you used.

I'm using the MS WebBrowser control and a textbox on a worksheet!

Now, the nitty-gritty; in exchange for that nicely parsed file,
what do i owe you?

A Timmies, straight up!

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Robert Baer

GS wrote:
I still do not understand what magic you used.

I'm using the MS WebBrowser control and a textbox on a worksheet!

Now, the nitty-gritty; in exchange for that nicely parsed file, what
do i owe you?

A Timmies, straight up!

The search engine was not exactly forthcoming, to say the least;
everything including the kitchen sink but NOT anything alcoholic.
"Timmies drink" helped some; fifth "hit" down: "Timmy's Sweet and
Sour mix Cocktails and Drink Recipes".
Using "Timmies, straight up" was slightly better.."Average night at
the Manotick Timmies... : ottawa"

In all of this,a lot of "hits" mentioned something(always different)
about Tim Hortons Franchise.

Absolutely no clue regarding rum, scotch, vodka (or dare i say) milk.

Auric__

Robert Baer wrote:

PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

Then you must be using a bad version, or perhaps have something wrong with
your .wgetrc. I've been using wget for around 10 years, and never had
anything like those issues unless I pass bad options.

--
My life is richer, somehow, simply because I know that he exists.

Robert Baer

Auric__ wrote:
Robert Baer wrote:

PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

Then you must be using a bad version, or perhaps have something wrong with
your .wgetrc. I've been using wget for around 10 years, and never had
anything like those issues unless I pass bad options.

Know nothing about .wgetrc; am in Win2K cmd line, and the batch file
used is:
H:
CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff

wget --no-check-certificate --output-document=5960_002.TXT
--output-file=log002.TXT
https://www.nsncenter.com/NSNSearch?...2&PageNumber=2

PAUSE

The SourceForge site offered a Zip which was supposed to be complete,
but none of the created folders had an EXE (tried Win2K, WinXP, Win7).
Found SofTonic offering only a plain jane wget.exe, which i am using,
so that may be a buggered version.
Suggestions?

Auric__

Robert Baer wrote:

Auric__ wrote:
Robert Baer wrote:

PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

Then you must be using a bad version, or perhaps have something wrong
with your .wgetrc. I've been using wget for around 10 years, and never
had anything like those issues unless I pass bad options.

Know nothing about .wgetrc;

Don't worry about it. It can be used to set default behaviors but every entry
can be replicated via switches.

am in Win2K cmd line, and the batch file
used is:
H:
CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff

wget --no-check-certificate --output-document=5960_002.TXT
--output-file=log002.TXT
https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO
N%20TUBE%22&PageNumber=2

That wget line performs as expected for me: 5960_002.TXT contains valid HTML
(although I haven't made any attempt to check the data; it looks like most of
the page is CSS) and log002.TXT is a typical wget log of a successful
transfer.

As for truncating the filenames, if I remove the --output-document switch,
the filename I get is

NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2

PAUSE

The SourceForge site offered a Zip which was supposed to be complete,

If you're talking about GNUwin32, that version is years out of date.

but none of the created folders had an EXE (tried Win2K, WinXP, Win7).
Found SofTonic offering only a plain jane wget.exe, which i am using,
so that may be a buggered version.

Never even heard of them.

Suggestions?

I'm using 1.16.3. No idea where I got it.

The batch file that I use for downloading looks like this:

call wget --no-check-certificate -x -c -e robots=off -i new.txt %*

-x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt).
-c Continue interrupted downloads.
-e Do this .wgetrc thing (in this case, ignore the robots.txt file).
-i Read list of filenames from the following file ("new.txt" because that's
the default name for a new file in my file manager).

I use the -i switch so I don't have to worry about escaping characters or %
vs %%. Whatever's in the text file is exactly what it looks for. (If you go
this route, it's one file per line.)

--
We have to stop letting George Lucas name our politicians.

Robert Baer

Auric__ wrote:
Robert Baer wrote:

Auric__ wrote:
Robert Baer wrote:

PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

Then you must be using a bad version, or perhaps have something wrong
with your .wgetrc. I've been using wget for around 10 years, and never
had anything like those issues unless I pass bad options.

Know nothing about .wgetrc;

Don't worry about it. It can be used to set default behaviors but every entry
can be replicated via switches.

am in Win2K cmd line, and the batch file
used is:
H:
CD\Win2K_WORK\OIL4LESS\LLCDOCS\FED app\FBA stuff

wget --no-check-certificate --output-document=5960_002.TXT
--output-file=log002.TXT
https://www.nsncenter.com/NSNSearch?...d%20%22ELECTRO
N%20TUBE%22&PageNumber=2

That wget line performs as expected for me: 5960_002.TXT contains valid HTML
(although I haven't made any attempt to check the data; it looks like most of
the page is CSS) and log002.TXT is a typical wget log of a successful
transfer.

As for truncating the filenames, if I remove the --output-document switch,
the filename I get is

NSNSearch@q=5960%20regulator%20and%20%22ELECTRON%2 0TUBE%22&PageNumber=2

PAUSE

The SourceForge site offered a Zip which was supposed to be complete,

If you're talking about GNUwin32, that version is years out of date.

but none of the created folders had an EXE (tried Win2K, WinXP, Win7).
Found SofTonic offering only a plain jane wget.exe, which i am using,
so that may be a buggered version.

Never even heard of them.

Suggestions?

I'm using 1.16.3. No idea where I got it.

The batch file that I use for downloading looks like this:

call wget --no-check-certificate -x -c -e robots=off -i new.txt %*

-x Always create directories (e.g. http://a.b.c/1/2.txt - .\a.b.c\1\2.txt).
-c Continue interrupted downloads.
-e Do this .wgetrc thing (in this case, ignore the robots.txt file).
-i Read list of filenames from the following file ("new.txt" because that's
the default name for a new file in my file manager).

I use the -i switch so I don't have to worry about escaping characters or %
vs %%. Whatever's in the text file is exactly what it looks for. (If you go
this route, it's one file per line.)

You must have a different version of Wget; whatever i do on the
command line,including the "trick" of restrict-file-names=nocontrol, i
get a buggered path name plus the response &PageNumber not recognized.
Exactly same results in Win2K, WinXP or in Win7.

Yes, i used GNUwin32 as SourceForge "complete" of Wget had no EXE.
Is there some other (compiled, complete) source i should get?

Robert Baer

Auric__ wrote:
Robert Baer wrote:

PS: i found WGET to be non-useful (a) it truncates the filename (b) it
buggers it to partial gibberish.

Then you must be using a bad version, or perhaps have something wrong with
your .wgetrc. I've been using wget for around 10 years, and never had
anything like those issues unless I pass bad options.

I also tried versions 1.18 and 1.13 from
https://eternallybored.org/misc/wget/.
Exactly the same truncation and gibberish.
At least, the 1.13 ZIP had wgetrc in the /etc folder; perhaps one
step forward.
No nobody said where to put the folder set, and certainly nothing
about set path, which just maybe perhaps might be useful for operation.

Thread Tools	Search this Thread
Show Printable Version	Search this Thread: Advanced Search
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
EOF Parse Text file	Bam	Excel Programming	2	September 24th 08 04:13 AM
Parse a txt file and save as csv?	Frank Pytel	Excel Programming	4	September 14th 08 09:23 PM
parse from txt file	geebee	Excel Programming	3	August 19th 08 07:55 PM
Parse File Location	Mike	Excel Worksheet Functions	5	October 3rd 07 04:05 PM
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6)	Steve[_29_]	Excel Programming	3	August 25th 03 10:43 PM

Menu

About Us