Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web CORRECTION#2

So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from
https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4):

1st occurance of <a href="/NSN/5960 is at line 7878;

1st occurance of (MCRL) is at line 7931;

1st occurance after that of <a href="/PartNumber" is this at line 7951;
<td align="center" style="vertical-align: middle;"<a
href="/PartNumber/GV4S1400"GV4S1400</a</td

and the next 3 lines a
<td style="width: 125px; height: 60px; vertical-align: middle;"
align="center" nowrap&nbsp;&nbsp;<a
href="/CAGE/63060"63060</a&nbsp;&nbsp;</td
<td align="center" style="vertical-align: middle;"&nbsp;&nbsp;<a
href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a&nbsp;&nbsp;</td
<td text-align="center" style="vertical-align: middle;"<a title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td


So you want to go to the next page linked to and repeat the process?

At this point my Excel sheet has been modified as follows:

Source | NSN Item# | Description | Part# | MCRL#
Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a

General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 |
5074477
line1
line2
line3

...and so on.

So far, I'm working with text files and so I'm inclined to append each
item to a file named "ElectronTube_NSN5960.txt". File contents for the
2 items above would be structured so the 1st line contains headings
(data fields) so it works properly with ADODB. (Note that I use a comma
as delimiter, and the file does not contain any blank lines)...

Source,NSN Item#,Description,Part#,MCRL#
Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a
General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477
<a href="/CAGE/1VPW8"1VPW8</a
<a href="/CAGE/1VPW8"<img class="img-thumbnail"
src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8"
height=45 width=90 /</a
<a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS,
INC.</a

...where I have parsed off the CSS formatting text and html tags outside
<a...</a from the 3 lines. I'd likely convert the UCase to proer case
as well.

The file size is 653 bytes meaning a full page would be about 4kb; 1000
pages being about 4mb. That's 44 lines per page after the fields line.

A file this size can be easily handled via ADO recordset or std VB file
I/O functions/methods. Loading into an array (vData) puts fields in
vData(0) and records starting at vData(1), and looping would Step 4.

I really don't have the time/energy (I have Lou Gehrig's) to get any
more involved with your project due to current commitments. I just felt
it might be worth explaining how I'd handle your task in the hopes it
would be helpful to you reaching a viable solution. I bid you good
wishes going forward...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #2   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web CORRECTION#2

GS wrote:
So what you also want is the linked file (web page) the image or part#
links to! Here's what I got from
https://www.nsncenter.com/NSN/5960-00-831-8683 (pg4):

1st occurance of <a href="/NSN/5960 is at line 7878;

1st occurance of (MCRL) is at line 7931;

1st occurance after that of <a href="/PartNumber" is this at line 7951;
<td align="center" style="vertical-align: middle;"<a
href="/PartNumber/GV4S1400"GV4S1400</a</td

and the next 3 lines a
<td style="width: 125px; height: 60px; vertical-align: middle;"
align="center" nowrap&nbsp;&nbsp;<a
href="/CAGE/63060"63060</a&nbsp;&nbsp;</td
<td align="center" style="vertical-align: middle;"&nbsp;&nbsp;<a
href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a&nbsp;&nbsp;</td
<td text-align="center" style="vertical-align: middle;"<a title="CAGE
63060" href="/CAGE/63060"HEICO OHMITE LLC</a</td


So you want to go to the next page linked to and repeat the process?

At this point my Excel sheet has been modified as follows:

Source | NSN Item# | Description | Part# | MCRL#
Tektronix | 5960-00-831-8683 | ELECTRON TUBE | GV4S1400 | 4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a

General Dynamics | 5960-00-853-8207 | ELECTRON TUBE | 295-29434 | 5074477
line1
line2
line3

..and so on.

So far, I'm working with text files and so I'm inclined to append each
item to a file named "ElectronTube_NSN5960.txt". File contents for the 2
items above would be structured so the 1st line contains headings (data
fields) so it works properly with ADODB. (Note that I use a comma as
delimiter, and the file does not contain any blank lines)...

Source,NSN Item#,Description,Part#,MCRL#
Tektronix,5960-00-831-8683,ELECTRON TUBE,GV4S1400,4932653
<a href="/CAGE/63060"63060</a
<a href="/CAGE/63060"<img class="img-thumbnail"
src="https://placehold.it/90x45?text=No%0DImage%0DYet" height=45
width=90 /</a
<a title="CAGE 63060" href="/CAGE/63060"HEICO OHMITE LLC</a
General Dynamics,5960-00-853-8207,ELECTRON TUBE,295-29434,5074477
<a href="/CAGE/1VPW8"1VPW8</a
<a href="/CAGE/1VPW8"<img class="img-thumbnail"
src="https://az774353.vo.msecnd.net/cage/90/1vpw8.jpg" alt="CAGE 1VPW8"
height=45 width=90 /</a
<a title="CAGE 1VPW8" href="/CAGE/1VPW8"GENERAL DYNAMICS C4 SYSTEMS,
INC.</a

..where I have parsed off the CSS formatting text and html tags outside
<a...</a from the 3 lines. I'd likely convert the UCase to proer case
as well.

The file size is 653 bytes meaning a full page would be about 4kb; 1000
pages being about 4mb. That's 44 lines per page after the fields line.

A file this size can be easily handled via ADO recordset or std VB file
I/O functions/methods. Loading into an array (vData) puts fields in
vData(0) and records starting at vData(1), and looping would Step 4.

I really don't have the time/energy (I have Lou Gehrig's) to get any
more involved with your project due to current commitments. I just felt
it might be worth explaining how I'd handle your task in the hopes it
would be helpful to you reaching a viable solution. I bid you good
wishes going forward...

* Thanks for the guide.


You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" &
PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.

I think that by now you have it all figured out.

In snooping around,i have just stumbled on the ADODB scheme,and what
little i have found so far it looks very promising.
Only one example which does not work (examples NEVER work) and zero
explanations so far.
It seems that with the proper code, that ADODB would allow me to copy
those first files to a HD.

Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?

Thanks again.




  #3   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web CORRECTION#2

You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber="
& PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.


I got Source, NSN Part#, Description from the 1st file. The NSN Item#
links to the 2nd file.

<snip
Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?


I use std VB file I/O not ADODB. Initial procedure was to copy/paste
page source into Textpad and save as Tmp.txt. Then load the file into
an array and parse from there.

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts, but haven't had the time. I
assume this would definitely give you an advantage over trying to
automate IE, but I need to research using it. I do have URL functions
built into my fpSpread.ocx for doing this stuff, but that's an
expensive 3rd party AX component. Otherwise, doing this from Excel
isn't something I'm familiar with.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #4   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web CORRECTION#2

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts


While I'm on pause waiting to consult with client on my current
project...

This is doable; -I have a userform with web browser, a textbox, and
some buttons.

The Web Browser doesn't display AddressBar/StatusBar for some reason,
even though these props are set 'True'. (Initial URL (pg1) is
hard-coded as a result) You navigate to parent pages using Next/Last
buttons, starting with pg1 on load. Optionally, you can enter a page#
in a GoTo box.

The browser lets you select links, and you load its current document
page source into txtViewSrc via btnViewSrc. This action also Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired
lines at present, but I will have it appending them to file shortly.
This file will be structured same as illustrated earlier. I think this
could be fully automated after I see how the links are define in their
innerHTML.

For now, I'll provide a button to write found lines because it gives an
opportunity to preview the data going into your file. This will happen
via loading found lines into vaLinesOut() which is sized 0to3. This
will make the search sequence important so the output file has its
lines in the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be orders
of magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #5   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web CORRECTION#2

GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts


While I'm on pause waiting to consult with client on my current project...

This is doable; -I have a userform with web browser, a textbox, and some
buttons.

The Web Browser doesn't display AddressBar/StatusBar for some reason,
even though these props are set 'True'. (Initial URL (pg1) is hard-coded
as a result) You navigate to parent pages using Next/Last buttons,
starting with pg1 on load. Optionally, you can enter a page# in a GoTo box.

The browser lets you select links, and you load its current document
page source into txtViewSrc via btnViewSrc. This action also Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired
lines at present, but I will have it appending them to file shortly.
This file will be structured same as illustrated earlier. I think this
could be fully automated after I see how the links are define in their
innerHTML.

For now, I'll provide a button to write found lines because it gives an
opportunity to preview the data going into your file. This will happen
via loading found lines into vaLinesOut() which is sized 0to3. This will
make the search sequence important so the output file has its lines in
the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be orders of
magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

Way beyond me.
If in HTML one can copy an <a href="..." to the hard drive, then
that is all i need.



  #6   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web CORRECTION#2

GS wrote:
I thought I'd take a look at going with a userform and MS Web
Browser
control for more flexible programming opts


While I'm on pause waiting to consult with client on my current
project...

This is doable; -I have a userform with web browser, a textbox, and
some
buttons.

The Web Browser doesn't display AddressBar/StatusBar for some
reason,
even though these props are set 'True'. (Initial URL (pg1) is
hard-coded
as a result) You navigate to parent pages using Next/Last buttons,
starting with pg1 on load. Optionally, you can enter a page# in a
GoTo box.

The browser lets you select links, and you load its current
document
page source into txtViewSrc via btnViewSrc. This action also
Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your
desired
lines at present, but I will have it appending them to file
shortly.
This file will be structured same as illustrated earlier. I think
this
could be fully automated after I see how the links are define in
their
innerHTML.

For now, I'll provide a button to write found lines because it
gives an
opportunity to preview the data going into your file. This will
happen
via loading found lines into vaLinesOut() which is sized 0to3. This
will
make the search sequence important so the output file has its lines
in
the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured
for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be
orders of
magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

Way beyond me.
If in HTML one can copy an <a href="..." to the hard drive, then
that is all i need.


Just another approach, since you seem to be having difficulty getting
URLDownloadToFile() to work.

My approach reads innerHTML of web pages and outputs to txt file. Not
sure why you want to grab html and save to disc given the file size is
a concern. My approach puts parsed data from all 999 pages into a txt
file less than 4mb in size. Once the individual steps have been
optimized, automating the entire process will be easy. (I'll leave that
part for you to do however you want it to work)

I will post the contents of my fParseWebPages.frm file. You will need
to set a ref to the Microsoft Web Browser to use it.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #7   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web CORRECTION#2

GS wrote:
GS wrote:
I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts

While I'm on pause waiting to consult with client on my current
project...

This is doable; -I have a userform with web browser, a textbox, and some
buttons.

The Web Browser doesn't display AddressBar/StatusBar for some reason,
even though these props are set 'True'. (Initial URL (pg1) is hard-coded
as a result) You navigate to parent pages using Next/Last buttons,
starting with pg1 on load. Optionally, you can enter a page# in a
GoTo box.

The browser lets you select links, and you load its current document
page source into txtViewSrc via btnViewSrc. This action also Splits()
page source into vPgSrc for locating search text selected in
cboSearchTxt. The cboSearchTxt_Change event auto-locates your desired
lines at present, but I will have it appending them to file shortly.
This file will be structured same as illustrated earlier. I think this
could be fully automated after I see how the links are define in their
innerHTML.

For now, I'll provide a button to write found lines because it gives an
opportunity to preview the data going into your file. This will happen
via loading found lines into vaLinesOut() which is sized 0to3. This will
make the search sequence important so the output file has its lines in
the correct order (top-to-bottom in page source).

I use my own file read/write procedures because they're configured for
large amounts of data in 1 shot to/from dbase.txt files, and so are
included in the userform class.

While there's still a manual element to this, it's going to be orders of
magnitude less daunting and more efficient that what you do now. It
seems highly likely over time that this entire task can be fully
automated just by entering the URL for pg1!

Way beyond me.
If in HTML one can copy an <a href="..." to the hard drive, then that
is all i need.


Just another approach, since you seem to be having difficulty getting
URLDownloadToFile() to work.

* snooped around the Excel VB help to find alternate ways (i remember
seeing other ways). Here was the first step:
SRC1$ =
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=5"

''test..
With Worksheets(1)
.Hyperlinks.Add .Range("E5"), SRC1$
End With
**
That places the URL in cell E5 and it looked good.
Clicking once on it "to follow", i get error message "Unable to
open...Cannot download the information you requested."
This may have the same 2 problems,the (implied) space,and the https.
So, to test those,i added a folder on one of my websites:
http://www.oil4lessllc.com/Try%20.space/
...and that works.
Therefore, this scheme is one better than the function.
Next, i tested the https with: https://duckduckgo.com/ .
I get that error message.
So i am still dead in the water.


My approach reads innerHTML of web pages and outputs to txt file.

* THAT is what i cannot do; AFAIK neither HTML5 nor javascript can
write to a file.

Not sure why you want to grab html and save to disc given the file

size is a
concern.

* Well, when i first started, i got a bit scared because of those 7750
(appx) lines; first time i saw a web site that huge.
Now that i have processed 30 pages, i know better.

My approach puts parsed data from all 999 pages into a txt file
less than 4mb in size. Once the individual steps have been optimized,
automating the entire process will be easy. (I'll leave that part for
you to do however you want it to work)

* So far, in that direction, i have:
<header
<a
href="https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=5"
id="NSN5860" hello<BR</a
<script
var T = document.anchors;
document.getElementById("NSN5860").innerHTML = T;
// open an output stream in a new window
document.open("text/html",replace);
document.writeln(T+"|<BR");
// display in new window
document.close();
</script
<!-- above gets
[object HTMLCollection]
on new screen/window, where that/object item is the href pointer
--


I will post the contents of my fParseWebPages.frm file. You will need to
set a ref to the Microsoft Web Browser to use it.


  #8   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web CORRECTION#2

GS wrote:
You are getting all of the right stuff from what i would call the
second file.
The first file is
"https://www.nsncenter.com/NSNSearch?q=5960%20regulator&PageNumber=" &
PageNum where PagNum (in ASCII) goes from "1" to "999".
Note the (implied?) space in the URL.


I got Source, NSN Part#, Description from the 1st file. The NSN Item#
links to the 2nd file.

<snip
Would you be so kind as to share your working ADODB code?
Or did you hand-copy the source like i did?


I use std VB file I/O not ADODB. Initial procedure was to copy/paste
page source into Textpad and save as Tmp.txt. Then load the file into an
array and parse from there.

I thought I'd take a look at going with a userform and MS Web Browser
control for more flexible programming opts, but haven't had the time. I
assume this would definitely give you an advantage over trying to
automate IE, but I need to research using it. I do have URL functions
built into my fpSpread.ocx for doing this stuff, but that's an expensive
3rd party AX component. Otherwise, doing this from Excel isn't something
I'm familiar with.

Check. I know QBASIC fairly well, so a lot of that knowledge crosses
over to VB.
Someone here was kind enough to give me a full working program that
can be used to copy a URL source to a temp file on the HD.
Once available all else is very simple and straight forward.
The rub is that function (or something it uses) does not allow a
space in the URL,AND also does not allow https.
So, i need two work-arounds, and the https part would seem to be the
worst.
I do not know how it works, what DLLs/libraries it calls; no useful
information is available.
It is:
Declare Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As Long, _
ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Only the well-known keywords can be found; 'urlmon', 'pCaller',
'szURL', and 'szFileName' are unknowns and not findable in the so-called
VB help.
And there are no examples; the few ranDUMB ones are incomplete and/or
do not work..

I do not see how you use std VB file I/O; AFAIK one cannot open a web
page as if it was a file.




Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
EOF Parse Text file Bam Excel Programming 2 September 24th 08 04:13 AM
Parse a txt file and save as csv? Frank Pytel Excel Programming 4 September 14th 08 09:23 PM
parse from txt file geebee Excel Programming 3 August 19th 08 07:55 PM
Parse File Location Mike Excel Worksheet Functions 5 October 3rd 07 04:05 PM
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) Steve[_29_] Excel Programming 3 August 25th 03 10:43 PM


All times are GMT +1. The time now is 04:16 AM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"