View Single Post
  #4   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
GS[_6_] GS[_6_] is offline
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

I note that in Windoze, that "\" is used,and on the web, "/" is
used.


For clarity.., a Windows file path is not the same as a URL.

Windows file paths allow spaces; URLs do not.
Windows path delimiter is "\"; Web path delimiter is "/".

The function URLDownloadToFile() downloads szURL to szFilename.

Once downloaded, szFilename needs to be opened, parsed, and result
stored locally. Then the next file needs to be downloaded, parsed, and
stored. And so on until all files have been downloaded and parsed.

Since the actual page contents comprise only a small portion of the
files being downloaded, there size should be considerably smaller after
parsing. If you extract the data (text) only (no images) and save this
to a txt file you should be able to 'append' to a single file which
would result in occupying far less disc space. (For example, pg5 is
less than 1kb) I suspect, though, that you need the image to identify
the item source (Raytheon, RCA, Lucent Tech, NAWC, MIL STD, etc)
because this info is not stored in the image file metadata. Otherwise,
the txt file after parsing pg5's text is the following 53 lines:

NSN 5960-00-509-3171
5960-00-509-3171

ELECTRON TUBE

NSN 5960-00-569-9531
5960-00-569-9531

ELECTRON TUBE

NSN 5960-00-553-3770
5960-00-553-3770

ELECTRON TUBE

NSN 5960-00-682-8624
5960-00-682-8624

ELECTRON TUBE

NSN 5960-00-808-6928
5960-00-808-6928

ELECTRON TUBE

NSN 5960-00-766-1953
5960-00-766-1953

ELECTRON TUBE

NSN 5960-00-850-6169
5960-00-850-6169

ELECTRON TUBE

NSN 5960-00-679-8153
5960-00-679-8153

ELECTRON TUBE

NSN 5960-00-134-6884
5960-00-134-6884

ELECTRON TUBE

NSN 5960-00-061-8610
5960-00-061-8610

ELECTRON TUBE

5960-00-067-9636

ELECTRON TUBE

The file size is 711 bytes, and lists 11 items. Note the last item has
no image and so no filler text (NSN line). This inconsistency makes
parsing the contents difficult since you don't know which items do not
have images.

If you copy/paste pg5 into Excel you get both text and image. You could
then do something to construct the info in a database fashion...

Col Headers:
Source :: PartNum :: Description

...and put the data in the respective columns. This seems very
inefficient but is probably less daunting than what you've been doing
manually thus far. Auto Complete should be helpful with this, and you
could sort the list by Source. Note that clicking the image or part# on
the worksheet takes you to the same page as does clicking it on the web
page. In the case of pg5, the data will occupy 11 rows.

Seems like your approach is the long way; -I'd find a better data
source myself! Perhaps subscribe to an electronics database utility
(such as my CAD software would use) that I can update by downloading a
single db file<g

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus