Home |
Search |
Today's Posts |
#1
Posted to microsoft.public.excel.programming,microsoft.public.excel.misc
|
|||
|
|||
Parse csv files
Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? Thanks |
#2
Posted to microsoft.public.excel.misc
|
|||
|
|||
Parse csv files
My understanding is that a .csv file has data fields separated by a comma character, and if the field contains a comma then that field is enclosed in quotes (being double-quotes), and if the field contains quotes then those are indicated by two consecutive quotes. The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps better explains .csv files, and has a pointer to required drivers. Hope this helps. -- rob Wrote: Here is what I want to do. From the internet I download some data that are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? Thanks -- Bryan Hessey ------------------------------------------------------------------------ Bryan Hessey's Profile: http://www.excelforum.com/member.php...o&userid=21059 View this thread: http://www.excelforum.com/showthread...hreadid=537641 |
#3
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
Bryan,
This is also my understanding of csv files (though some people say that for double quotes you don't need double double quotes). In any case, the problem is that most cells use double double quotes for double quotes. Unfortunately, some cells don't seem to be formated right and don't enclose double quotes into double double quotes as outlined in my example. I know that Excel can load it just fine but with odbc it does not work. Any idea if the parsing algorithm used by Excel is somehow accessible through the .NET framework (I was hoping Excel uses odbc). Thanks Bryan Hessey wrote: My understanding is that a .csv file has data fields separated by a comma character, and if the field contains a comma then that field is enclosed in quotes (being double-quotes), and if the field contains quotes then those are indicated by two consecutive quotes. The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps better explains .csv files, and has a pointer to required drivers. Hope this helps. -- rob Wrote: Here is what I want to do. From the internet I download some data that are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? |
#4
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
Bryan,
This is also my understanding of csv files (though some people say that for double quotes you don't need double double quotes). In any case, the problem is that most cells use double double quotes for double quotes. Unfortunately, some cells don't seem to be formated right and don't enclose double quotes into double double quotes as outlined in my example. I know that Excel can load it just fine but with odbc it does not work. Any idea if the parsing algorithm used by Excel is somehow accessible through the .NET framework (I was hoping Excel uses odbc). Thanks Bryan Hessey wrote: My understanding is that a .csv file has data fields separated by a comma character, and if the field contains a comma then that field is enclosed in quotes (being double-quotes), and if the field contains quotes then those are indicated by two consecutive quotes. The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps better explains .csv files, and has a pointer to required drivers. Hope this helps. -- rob Wrote: Here is what I want to do. From the internet I download some data that are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? |
#5
Posted to microsoft.public.excel.misc
|
|||
|
|||
Parse csv files
Beyond my scope, but http://www.creativyst.com/Doc/Articl...m#CSVariations and it's comments on Excel might suggest that it doesn't. -- rob Wrote: Bryan, This is also my understanding of csv files (though some people say that for double quotes you don't need double double quotes). In any case, the problem is that most cells use double double quotes for double quotes. Unfortunately, some cells don't seem to be formated right and don't enclose double quotes into double double quotes as outlined in my example. I know that Excel can load it just fine but with odbc it does not work. Any idea if the parsing algorithm used by Excel is somehow accessible through the .NET framework (I was hoping Excel uses odbc). Thanks Bryan Hessey wrote: My understanding is that a .csv file has data fields separated by a comma character, and if the field contains a comma then that field is enclosed in quotes (being double-quotes), and if the field contains quotes then those are indicated by two consecutive quotes. The site http://en.wikipedia.org/wiki/Comma-separated_values perhaps better explains .csv files, and has a pointer to required drivers. Hope this helps. -- rob Wrote: Here is what I want to do. From the internet I download some data that are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? -- Bryan Hessey ------------------------------------------------------------------------ Bryan Hessey's Profile: http://www.excelforum.com/member.php...o&userid=21059 View this thread: http://www.excelforum.com/showthread...hreadid=537641 |
#6
Posted to microsoft.public.excel.misc,microsoft.public.excel.programming
|
|||
|
|||
Parse csv files
An (Excel) CSV file is:
1. simply a text file with columns delimited by a comma 2. simply a text file with columns delimited by a comma and text values enclosed in double quotes. In 1, a missing value is indicated by two contiguous commas, and in 2. a missing value is indicated by ,"", Your CSV file is much more complicated (dos not follow the patter 1 or 2) and you appear to be using ADO.NET. Although you have ruled it out, you might have to pre-process your file as 1 or 2 above before you can execute SQL queries. You might find that when the SQL query works, it ignores the first row (unless it contains or you add column names in the first row). "rob" wrote: Here is what I want to do. From the internet I download some data that are in csv format. All data will be in one long string. Now I need to extract every cell. The problem ist that some cell content contains commas and/or double quotes. Some of the cell contents that contain double quotes use double double quotes and others don't, i.e. some look like "this "item" is bad", "this item is ok" and others like "this ""item"" is bad", "this item is ok" There is also a chance that some cell are in double quotes (if they contains commas or double quotes) and others are not in double quotes (if they do not contain commas or double quotes). Considering all this (and possibly more stuff) parsing becomes non trivial. As a first approach I stored the content downloaded into a file and then use odbc like this: connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader(); Unfortunately, this approach does not work for the above scenarios. Excel reads the files in question just fine, though. So my question is what is the best approach to read csv files, preferably without having to create temporary files? Thanks |
#7
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
"rob" wrote in message ups.com... Bryan, This is also my understanding of csv files (though some people say that for double quotes you don't need double double quotes). In any case, the problem is that most cells use double double quotes for double quotes. Unfortunately, some cells don't seem to be formated right and don't enclose double quotes into double double quotes as outlined in my example. I know that Excel can load it just fine but with odbc it does not work. Any idea if the parsing algorithm used by Excel is somehow accessible through the .NET framework (I was hoping Excel uses odbc). Thanks Is this CSV file(s) going to be an on-going thing? If not, and since you say it works fine in Excel, how about opening it in Excel and then re-exporting it to either another CSV file or TabDelimited file and then running your code against that? That way, you may be able to get the data out in the formatting you need for your application. james |
#8
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of the cell content contains tabs as well. Another option would be to open the file in Excel (programatically) and then read out the cells. This is a huge overkill, though. Worse it's something that will run on a server. Although it usually runs during off-peak times I still don't like the idea doing it this way. Therefore, I might really end up writing my own parser. Thanks for the input, though. It's definitely appreciated. |
#9
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
"rob" wrote in message ups.com... This will be an ongoing thing. I was thinking about opening it in Excel (programatically) and store it with tabs but there is a chance some of the cell content contains tabs as well. Another option would be to open the file in Excel (programatically) and then read out the cells. This is a huge overkill, though. Worse it's something that will run on a server. Although it usually runs during off-peak times I still don't like the idea doing it this way. Therefore, I might really end up writing my own parser. Thanks for the input, though. It's definitely appreciated. Your welcome. Sorry I couldn't offer any better suggestion. It does sound as though you will have to write some sort of parser yourself. The scary part is you have no control over how the original file is created and what sort of input it allows. And with that being the case, it seems that writing a parser routine that will consistantly return the correct output, will be tough. Especially if the app will be running unattended. Otherwise you could write an editor of sorts that would allow a user to make some decisions on what to keep and what to discard. That would go a long way towards increasing the dependability of the app. (as long as the end uers understands what needs to be retained and what needs to be discarded) I don't envy you on this one!!! james |
#10
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
Try this:
http://www.codeproject.com/cs/database/CsvReader.asp Cheers, Jason On 1 May 2006 09:34:19 -0700, rob wrote: This will be an ongoing thing. I was thinking about opening it in Excel (programatically) and store it with tabs but there is a chance some of the cell content contains tabs as well. Another option would be to open the file in Excel (programatically) and then read out the cells. This is a huge overkill, though. Worse it's something that will run on a server. Although it usually runs during off-peak times I still don't like the idea doing it this way. Therefore, I might really end up writing my own parser. Thanks for the input, though. It's definitely appreciated. |
#11
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
Jason,
Thanks a lot for this link. It sounds promising and I'll give it a try. I hope that'll safe my day...:) |
#12
Posted to microsoft.public.excel.misc,microsoft.public.dotnet.general
|
|||
|
|||
Parse csv files
I did try the CsvReader and it fails with the exact same item as in the
odbc approach. One additional thing I realized is that many items/cells do have words in double quotes. As far as I know double quotes should be put in double double quotes. The data I have sometimes does that but often it does not. Both approaches (odbc, cvsReader) work fine (can read though don't render correctly) if the cell in question is the last one. If it is not the last one both odbc and cvsReader get thrown off, though. So the conclusion here is that from all I know only Excel can handle things in a way that works out but that parser is not directly accessible in the .NET framework. Therefore, I will have to write my own parser or maybe modify cvsReader. |
Reply |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Sharing read-write Excel 2003 files | Excel Discussion (Misc queries) | |||
Recently Used File List - 2002 Contains 'Temp' Files | Excel Discussion (Misc queries) | |||
merging my excel files | Excel Discussion (Misc queries) | |||
Cannot access read-only documents. | Excel Discussion (Misc queries) | |||
Automatic Data Import | Excel Discussion (Misc queries) |