#1   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel.misc
rob
 
Posts: n/a
Default Parse csv files

Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?

Thanks

  #2   Report Post  
Posted to microsoft.public.excel.misc
Bryan Hessey
 
Posts: n/a
Default Parse csv files


My understanding is that a .csv file has data fields separated by a
comma character, and if the field contains a comma then that field is
enclosed in quotes (being double-quotes), and if the field contains
quotes then those are indicated by two consecutive quotes.

The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps
better explains .csv files, and has a pointer to required drivers.

Hope this helps.

--

rob Wrote:
Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some
look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?

Thanks



--
Bryan Hessey
------------------------------------------------------------------------
Bryan Hessey's Profile: http://www.excelforum.com/member.php...o&userid=21059
View this thread: http://www.excelforum.com/showthread...hreadid=537641

  #3   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
rob
 
Posts: n/a
Default Parse csv files

Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks


Bryan Hessey wrote:
My understanding is that a .csv file has data fields separated by a
comma character, and if the field contains a comma then that field is
enclosed in quotes (being double-quotes), and if the field contains
quotes then those are indicated by two consecutive quotes.

The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps
better explains .csv files, and has a pointer to required drivers.

Hope this helps.

--

rob Wrote:
Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some
look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?


  #4   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
rob
 
Posts: n/a
Default Parse csv files

Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks


Bryan Hessey wrote:
My understanding is that a .csv file has data fields separated by a
comma character, and if the field contains a comma then that field is
enclosed in quotes (being double-quotes), and if the field contains
quotes then those are indicated by two consecutive quotes.

The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps
better explains .csv files, and has a pointer to required drivers.

Hope this helps.

--

rob Wrote:
Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some
look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?


  #5   Report Post  
Posted to microsoft.public.excel.misc
Bryan Hessey
 
Posts: n/a
Default Parse csv files


Beyond my scope, but
http://www.creativyst.com/Doc/Articl...m#CSVariations and
it's comments on Excel might suggest that it doesn't.

--

rob Wrote:
Bryan,

This is also my understanding of csv files (though some people say
that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in
my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks


Bryan Hessey wrote:
My understanding is that a .csv file has data fields separated by a
comma character, and if the field contains a comma then that field

is
enclosed in quotes (being double-quotes), and if the field contains
quotes then those are indicated by two consecutive quotes.

The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps
better explains .csv files, and has a pointer to required drivers.

Hope this helps.

--

rob Wrote:
Here is what I want to do. From the internet I download some data

that
are in csv format. All data will be in one long string. Now I need

to
extract every cell. The problem ist that some cell content

contains
commas and/or double quotes. Some of the cell contents that

contain
double quotes use double double quotes and others don't, i.e. some
look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if

they
contains commas or double quotes) and others are not in double

quotes
(if they do not contain commas or double quotes). Considering all

this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file

and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above

scenarios.
Excel reads the files in question just fine, though. So my question

is
what is the best approach to read csv files, preferably without

having
to create temporary files?



--
Bryan Hessey
------------------------------------------------------------------------
Bryan Hessey's Profile: http://www.excelforum.com/member.php...o&userid=21059
View this thread: http://www.excelforum.com/showthread...hreadid=537641



  #6   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.excel.programming
AA2e72E
 
Posts: n/a
Default Parse csv files

An (Excel) CSV file is:
1. simply a text file with columns delimited by a comma
2. simply a text file with columns delimited by a comma and text values
enclosed in double quotes.
In 1, a missing value is indicated by two contiguous commas, and in 2. a
missing value is indicated by ,"",
Your CSV file is much more complicated (dos not follow the patter 1 or 2)
and you appear to be using ADO.NET. Although you have ruled it out, you might
have to pre-process your file as 1 or 2 above before you can execute SQL
queries. You might find that when the SQL query works, it ignores the first
row (unless it contains or you add column names in the first row).



"rob" wrote:

Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?

Thanks


  #7   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
james
 
Posts: n/a
Default Parse csv files


"rob" wrote in message
ups.com...
Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks



Is this CSV file(s) going to be an on-going thing? If not, and since you
say it
works fine in Excel, how about opening it in Excel and then re-exporting it
to
either another CSV file or TabDelimited file and then running your code
against that?
That way, you may be able to get the data out in the formatting you need for
your
application.
james


  #8   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
rob
 
Posts: n/a
Default Parse csv files

This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of
the cell content contains tabs as well. Another option would be to open
the file in Excel (programatically) and then read out the cells. This
is a huge overkill, though. Worse it's something that will run on a
server. Although it usually runs during off-peak times I still don't
like the idea doing it this way. Therefore, I might really end up
writing my own parser. Thanks for the input, though. It's definitely
appreciated.

  #9   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
james
 
Posts: n/a
Default Parse csv files


"rob" wrote in message
ups.com...
This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of
the cell content contains tabs as well. Another option would be to open
the file in Excel (programatically) and then read out the cells. This
is a huge overkill, though. Worse it's something that will run on a
server. Although it usually runs during off-peak times I still don't
like the idea doing it this way. Therefore, I might really end up
writing my own parser. Thanks for the input, though. It's definitely
appreciated.


Your welcome. Sorry I couldn't offer any better suggestion. It does sound as
though you will have to write some sort of parser yourself. The scary part
is
you have no control over how the original file is created and what sort of
input
it allows. And with that being the case, it seems that writing a parser
routine that
will consistantly return the correct output, will be tough. Especially if
the app will
be running unattended. Otherwise you could write an editor of sorts that
would allow
a user to make some decisions on what to keep and what to discard. That
would go
a long way towards increasing the dependability of the app. (as long as the
end uers
understands what needs to be retained and what needs to be discarded)
I don't envy you on this one!!!
james


  #10   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
JasonS
 
Posts: n/a
Default Parse csv files

Try this:
http://www.codeproject.com/cs/database/CsvReader.asp

Cheers,
Jason

On 1 May 2006 09:34:19 -0700, rob wrote:

This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of
the cell content contains tabs as well. Another option would be to open
the file in Excel (programatically) and then read out the cells. This
is a huge overkill, though. Worse it's something that will run on a
server. Although it usually runs during off-peak times I still don't
like the idea doing it this way. Therefore, I might really end up
writing my own parser. Thanks for the input, though. It's definitely
appreciated.



  #11   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
rob
 
Posts: n/a
Default Parse csv files

Jason,

Thanks a lot for this link. It sounds promising and I'll give it a try.
I hope that'll safe my day...:)

  #12   Report Post  
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
rob
 
Posts: n/a
Default Parse csv files

I did try the CsvReader and it fails with the exact same item as in the
odbc approach. One additional thing I realized is that many items/cells
do have words in double quotes. As far as I know double quotes should
be put in double double quotes. The data I have sometimes does that but
often it does not. Both approaches (odbc, cvsReader) work fine (can
read though don't render correctly) if the cell in question is the last
one. If it is not the last one both odbc and cvsReader get thrown off,
though.

So the conclusion here is that from all I know only Excel can handle
things in a way that works out but that parser is not directly
accessible in the .NET framework. Therefore, I will have to write my
own parser or maybe modify cvsReader.

Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sharing read-write Excel 2003 files ttt8262 Excel Discussion (Misc queries) 0 April 1st 06 09:39 PM
Recently Used File List - 2002 Contains 'Temp' Files Keith972002 Excel Discussion (Misc queries) 0 July 26th 05 01:46 PM
merging my excel files Donna YaWanna Excel Discussion (Misc queries) 1 June 14th 05 12:53 AM
Cannot access read-only documents. tomgillane Excel Discussion (Misc queries) 14 February 7th 05 10:53 PM
Automatic Data Import TxRaistlin Excel Discussion (Misc queries) 2 February 4th 05 10:43 PM


All times are GMT +1. The time now is 03:55 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"