Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
but does not have to have the date and time; that would fill the HD
since i have thousands of files to process.


Filenames have nothing to do with storage space; -it's the file size!
Given Auric_'s suggestion creates text files, the size of 999 txt files
would hardly be more the 1MB total! If you append each page to the 1st
file then all pages could be in 1 file...

True, BUT the files can be large:
"result = URLDownloadToFile(0, S$, tmp, 0, 0)" creates a file in TEMP
the size of the source - which can be multi-megabtes; 999 of them can
eat the HD space fast.
Hopefully a URL file size does not exceed the space limit allowed in
Excel 2003 string space (anyone know what that might be?).

I have found that the stringvalue AKA TEMP filename can be fixed to
anything reasonable, and does not have to include
parts/substrings/subsets of the file one wants to download.

I can be "a good thing" (quoting Martha Stewart) to delete the file
when done.

I have also found the following:
1) one does not have to use FreeFile for a file number (when all else is
OK).
2) cannot use "contents" for string storage space.
3) one cannot mix use of "/" and "\" in a string for a given file name.
4) one cannot have a space in the file name, so that gives a serious
problem for some web URLs (work-around anyone?)
5) method fails for "https:" (work-around anyone?)
  #2   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

I have also found the following:
1) one does not have to use FreeFile for a file number (when all else
is OK).


True, however not considered 'best practice'. Freefile() ensures a
unique ID is assigned to your var.

2) cannot use "contents" for string storage space.

Why not? It's not a VB[A] keyword and so qualifies for use as a var.


3) one cannot mix use of "/" and "\" in a string for a given file
name.


Not sure why you'd use "/" in a path string! Forward slash is not a
legal filename/path character. Backslash is the default Windows path
delimiter. If choosing folders, the last backslah is not followed by a
filename.

4) one cannot have a space in the file name, so that gives a serious
problem for some web URLs (work-around anyone?)


Web paths 'pad' spaces so the string is contiguous. I believe the pad
string is "%20" OR "+".

5) method fails for "https:" (work-around anyone?)


What method?

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #3   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
I have also found the following:
1) one does not have to use FreeFile for a file number (when all else
is OK).


True, however not considered 'best practice'. Freefile() ensures a
unique ID is assigned to your var.

2) cannot use "contents" for string storage space.

Why not? It's not a VB[A] keyword and so qualifies for use as a var.

* Get run-time error 458, "variable uses an Automation type not
supported in Visual Basic".



3) one cannot mix use of "/" and "\" in a string for a given file name.


Not sure why you'd use "/" in a path string! Forward slash is not a
legal filename/path character. Backslash is the default Windows path
delimiter. If choosing folders, the last backslah is not followed by a
filename.

I note that in Windoze, that "\" is used,and on the web, "/" is used.


4) one cannot have a space in the file name, so that gives a serious
problem for some web URLs (work-around anyone?)


Web paths 'pad' spaces so the string is contiguous. I believe the pad
string is "%20" OR "+".

Yes; "%20" is used and seems to act like a space and seems to kill
the URLDownloadToFile function usefulness.


5) method fails for "https:" (work-around anyone?)


What method?

* the function URLDownloadToFile. Is "method" the wrong term?




  #4   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

I note that in Windoze, that "\" is used,and on the web, "/" is
used.


For clarity.., a Windows file path is not the same as a URL.

Windows file paths allow spaces; URLs do not.
Windows path delimiter is "\"; Web path delimiter is "/".

The function URLDownloadToFile() downloads szURL to szFilename.

Once downloaded, szFilename needs to be opened, parsed, and result
stored locally. Then the next file needs to be downloaded, parsed, and
stored. And so on until all files have been downloaded and parsed.

Since the actual page contents comprise only a small portion of the
files being downloaded, there size should be considerably smaller after
parsing. If you extract the data (text) only (no images) and save this
to a txt file you should be able to 'append' to a single file which
would result in occupying far less disc space. (For example, pg5 is
less than 1kb) I suspect, though, that you need the image to identify
the item source (Raytheon, RCA, Lucent Tech, NAWC, MIL STD, etc)
because this info is not stored in the image file metadata. Otherwise,
the txt file after parsing pg5's text is the following 53 lines:

NSN 5960-00-509-3171
5960-00-509-3171

ELECTRON TUBE

NSN 5960-00-569-9531
5960-00-569-9531

ELECTRON TUBE

NSN 5960-00-553-3770
5960-00-553-3770

ELECTRON TUBE

NSN 5960-00-682-8624
5960-00-682-8624

ELECTRON TUBE

NSN 5960-00-808-6928
5960-00-808-6928

ELECTRON TUBE

NSN 5960-00-766-1953
5960-00-766-1953

ELECTRON TUBE

NSN 5960-00-850-6169
5960-00-850-6169

ELECTRON TUBE

NSN 5960-00-679-8153
5960-00-679-8153

ELECTRON TUBE

NSN 5960-00-134-6884
5960-00-134-6884

ELECTRON TUBE

NSN 5960-00-061-8610
5960-00-061-8610

ELECTRON TUBE

5960-00-067-9636

ELECTRON TUBE

The file size is 711 bytes, and lists 11 items. Note the last item has
no image and so no filler text (NSN line). This inconsistency makes
parsing the contents difficult since you don't know which items do not
have images.

If you copy/paste pg5 into Excel you get both text and image. You could
then do something to construct the info in a database fashion...

Col Headers:
Source :: PartNum :: Description

...and put the data in the respective columns. This seems very
inefficient but is probably less daunting than what you've been doing
manually thus far. Auto Complete should be helpful with this, and you
could sort the list by Source. Note that clicking the image or part# on
the worksheet takes you to the same page as does clicking it on the web
page. In the case of pg5, the data will occupy 11 rows.

Seems like your approach is the long way; -I'd find a better data
source myself! Perhaps subscribe to an electronics database utility
(such as my CAD software would use) that I can update by downloading a
single db file<g

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #5   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
I note that in Windoze, that "\" is used,and on the web, "/" is used.


For clarity.., a Windows file path is not the same as a URL.

Windows file paths allow spaces; URLs do not.

* Incorrect! See -------------------------vvv
https://www.nsncenter.com/NSNSearch?...r&PageNumber=5

Windows path delimiter is "\"; Web path delimiter is "/".

The function URLDownloadToFile() downloads szURL to szFilename.

* IF and when it works.


Once downloaded, szFilename needs to be opened, parsed, and result
stored locally. Then the next file needs to be downloaded, parsed, and
stored. And so on until all files have been downloaded and parsed.

* This i knew from the git-go; nice to be clarified.


Since the actual page contents comprise only a small portion of the
files being downloaded, there size should be considerably smaller after
parsing. If you extract the data (text) only (no images) and save this
to a txt file you should be able to 'append' to a single file which
would result in occupying far less disc space. (For example, pg5 is less
than 1kb) I suspect, though, that you need the image to identify the
item source (Raytheon, RCA, Lucent Tech, NAWC, MIL STD, etc) because
this info is not stored in the image file metadata. Otherwise, the txt
file after parsing pg5's text is the following 53 lines:

NSN 5960-00-509-3171
5960-00-509-3171

ELECTRON TUBE

NSN 5960-00-569-9531
5960-00-569-9531

ELECTRON TUBE

NSN 5960-00-553-3770
5960-00-553-3770

ELECTRON TUBE

NSN 5960-00-682-8624
5960-00-682-8624

ELECTRON TUBE

NSN 5960-00-808-6928
5960-00-808-6928

ELECTRON TUBE

NSN 5960-00-766-1953
5960-00-766-1953

ELECTRON TUBE

NSN 5960-00-850-6169
5960-00-850-6169

ELECTRON TUBE

NSN 5960-00-679-8153
5960-00-679-8153

ELECTRON TUBE

NSN 5960-00-134-6884
5960-00-134-6884

ELECTRON TUBE

NSN 5960-00-061-8610
5960-00-061-8610

ELECTRON TUBE

5960-00-067-9636

ELECTRON TUBE

The file size is 711 bytes, and lists 11 items. Note the last item has
no image and so no filler text (NSN line). This inconsistency makes
parsing the contents difficult since you don't know which items do not
have images.

* I think you may have pulled the info from what you saw on that page,
and not from the source.
In one of my responses, i gave QBASIC code for parsing, and as i
remember, there were about 7760 lines of junk before one sees <a
href="/NSN/5960; which gives the full NSN code.
Use of that allows one to get the second URL, eg:
https://www.nsncenter.com/NSN/5960-00-754-5782 NO image reliance at all.
There are 11 entries per page,and no inconsistencies with my method
of search in the page.


If you copy/paste pg5 into Excel you get both text and image. You could
then do something to construct the info in a database fashion...

* That would only make things more difficult. A copy to a local file is
sufficient for a simple parsing as described here and elsewhere in this
thread.


Col Headers:
Source :: PartNum :: Description

..and put the data in the respective columns. This seems very
inefficient but is probably less daunting than what you've been doing
manually thus far. Auto Complete should be helpful with this, and you
could sort the list by Source. Note that clicking the image or part# on
the worksheet takes you to the same page as does clicking it on the web
page. In the case of pg5, the data will occupy 11 rows.

* Manual: Right click, select View Page Source, Save as to HD by
changing Filetype from HTM to TXT and changing fiiename to add page
number (013 for example).
Like i said,parsing of that file is simple and easy; getting 35 pages
copied that way did not take long, but there are 999 of them...


Seems like your approach is the long way; -I'd find a better data source
myself! Perhaps subscribe to an electronics database utility (such as my
CAD software would use) that I can update by downloading a single db
file<g

* I have asked, and received zero response.




  #6   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

5) method fails for "https:" (work-around anyone?)

These URLs usually require some kind of 'login' be done, which needs to
be included in the URL string.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #7   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
5) method fails for "https:" (work-around anyone?)


These URLs usually require some kind of 'login' be done, which needs to
be included in the URL string.

NO login required; try it.

Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
EOF Parse Text file Bam Excel Programming 2 September 24th 08 04:13 AM
Parse a txt file and save as csv? Frank Pytel Excel Programming 4 September 14th 08 09:23 PM
parse from txt file geebee Excel Programming 3 August 19th 08 07:55 PM
Parse File Location Mike Excel Worksheet Functions 5 October 3rd 07 04:05 PM
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) Steve[_29_] Excel Programming 3 August 25th 03 10:43 PM


All times are GMT +1. The time now is 10:16 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"