Java Forum / General / March 2007
parsing a tab delimited or CSV, but keep the delimiter
Sideswipe - 22 Mar 2007 21:04 GMT I know this question has been asked before, and believe me I checked the newsgroup and web extensively before asking, but I think my needs are slightly different.
I need to parse either a CSV or a Tab delimited file, BUT I need to keep the delimiting token -- I am parsing these files as generated from excel and the user expects them to process EXACTLY as it appears in the spreadsheet.
I am cross posting this in the Perl and Java groups because, my implementation is in Java, but Perl users use regexp far more frequently.
Here are the 3 different REGEX expressions I have found /created but none are correct. The only certainty I can get is to get rid of all the delimiters. I have to maintain the delimiters because the information I am accessing is column based (and thus fixed)
private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,| [^,]+)"); // I think this close private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^ \"]|\"\")+\")(,)"); private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]* \\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");
So, you have the cases of:
1) continuous string or with space -> single ',' (comma) separated 2) String has a comma in it, and is "" -> it is followed by a ",," double comma token. So the string in "" is a token and the double comma is also a token 3) blank cells are just a single comma ,
That's my understanding of the cases. The logic should be IDENTICAL for tab delimited and simply substitute characters
Sideswipe - 22 Mar 2007 21:26 GMT It also looks like double comma before a "" like this ,,"" -- the ,, counts as 1 token
Chris Uppal - 22 Mar 2007 22:20 GMT > I need to parse either a CSV or a Tab delimited file, BUT I need to > keep the delimiting token -- I am parsing these files as generated [quoted text clipped - 4 lines] > implementation is in Java, but Perl users use regexp far more > frequently. I do not believe that regular expressions (even the weird "enhanced" regular expressions that Java has apparently borrowed from Perl) are a suitable tool for this job. Even if you could find some way of expressing the complex logic of CSV parsing using the extended regexps, your resulting code would be totally incomprehensible.
A useful test case for your parser is that /any/ CSV file can (treated as a single string) be used as the contents of a field within another CSV. Embedded newlines, arbitrarily deep nesting, etc, are all allowed. That power hardly makes it any harder to parse CVS /unless/ you've started out by relying on an inadequate tool like regexps.
Also, don't forget to test your code with files generated from Excel in places where the numeric thousands separator is not ','.
-- chris
P.S. Although I don't much like Java's regexps (I think they are /grossly/ overused), it seems that I like them rather better than OE's spellchecker, which suggests that replace "regexp" with "rage" and "regexps" with "creeps". True AI is clearly only just around the corner ;-)
Daniel Pitts - 22 Mar 2007 22:30 GMT > I know this question has been asked before, and believe me I checked > the newsgroup and web extensively before asking, but I think my needs [quoted text clipped - 8 lines] > implementation is in Java, but Perl users use regexp far more > frequently. Cross posting is fine, but you should add a follow up header. Which I have (to comp.lang.java.programmer)
> Here are the 3 different REGEX expressions I have found /created but > none are correct. The only certainty I can get is to get rid of all [quoted text clipped - 18 lines] > That's my understanding of the cases. The logic should be IDENTICAL > for tab delimited and simply substitute characters I'm not sure that a regex is good enough to do everything...
Anyway, here are the cases that I can think of, ignoring the delimiters.
Field value: Field value: , Field value: " Field value: a,b Field value: "a and b" Field value: 6"3
What are the encodings of this?
I'm guessing that
Field value: Encoded value: Field value: , Encoded value: "," Field value: " Encoded value: "" Field value: a,b Encoded value: "a,b" Field value: "a and b" Encoded value: ""a and b"" Field value: 6"3 Encoded value: 6""3
You can verify these cases in excel. If those ARE the correct cases, then this would work:
import java.util.List; import java.util.ArrayList; import java.util.regex.Pattern; import java.util.regex.Matcher;
public class ParseCSV { final static String quoted = "\"(:?[^\"]|\"\")+\""; public static List<String> parseCSV(String csv, String delim) { final Pattern NEXT_COLUMN = nextColumnRegex(delim); final List<String> strings = new ArrayList<String>(); final Matcher matcher = NEXT_COLUMN.matcher(csv); while (!matcher.hitEnd() && matcher.find()) { String match = matcher.group(1); if (match.matches(quoted)) { match = match.substring(1, match.length() - 1); } match = match.replaceAll("\"\"", "\""); strings.add(match); } return strings; }
private static Pattern nextColumnRegex(String comma) { String unquoted = "(:?[^\"" + comma + "]|\"\")*"; String ending = "(:?" + comma +"|$)"; return Pattern.compile('(' + quoted + '|' + unquoted + ')' + ending); }
public static void main(String[] args) { String csv = ",\",\",\"\",\"a,b\",\"\"a and b\"\",6\"\"3"; List<String> result = parseCSV(csv, ","); for (String col : result) { System.out.println("Field value:" + col); } }
}
Jürgen Exner - 22 Mar 2007 22:34 GMT > I need to parse either a CSV or a Tab delimited file, BUT I need to > keep the delimiting token -- I am parsing these files as generated > from excel and the user expects them to process EXACTLY as it appears > in the spreadsheet. Text::CSV should do the job quite nicely for regular CSV files.
> Here are the 3 different REGEX expressions I have found /created but > none are correct. [...] That doesn't suprise me. While REs are quite powerful indeed, there are many, many tasks, where you need a more complex logic than REs can provide. Even if you have custom requirements and you need a custom version of a CSV parser I would still start with one of the Text::CSV modules and customize that instead of rolling my own code from scratch and repeating all the mistakes that have been fixed in the ready-made modules a long time ago.
jue
Lew - 23 Mar 2007 22:28 GMT >> I need to parse either a CSV or a Tab delimited file, BUT I need to >> keep the delimiting token -- I am parsing these files as generated >> from excel and the user expects them to process EXACTLY as it appears >> in the spreadsheet. > > Text::CSV should do the job quite nicely for regular CSV files. Are you referring to <http://www.hxtt.com/text.html>?
-- Lew
Jürgen Exner - 23 Mar 2007 23:19 GMT >>> I need to parse either a CSV or a Tab delimited file, BUT I need to >>> keep the delimiting token -- I am parsing these files as generated [quoted text clipped - 4 lines] > > Are you referring to <http://www.hxtt.com/text.html>? No, why would I? Of course I am referring to the module Text::CSV (or one of its cousins): http://search.cpan.org/search?query=text%3A%3Acsv&mode=all
jue
Lew - 23 Mar 2007 23:32 GMT >>>> I need to parse either a CSV or a Tab delimited file, BUT I need to >>>> keep the delimiting token -- I am parsing these files as generated [quoted text clipped - 4 lines] > > No, why would I? There were so many that came up when I googled that I just picked one at random, on the basis that it was eight out of the first ten sites that popped up in my search.
> Of course I am referring to the module Text::CSV (or one of its cousins): > http://search.cpan.org/search?query=text%3A%3Acsv&mode=all Of course! I should have /known/ that it was one of the other 68790 hits, instead of the one that was eight out of the first ten.
-- Lew
Jürgen Exner - 23 Mar 2007 23:38 GMT >>>>> I need to parse either a CSV or a Tab delimited file, BUT I need >>>>> to keep the delimiting token -- I am parsing these files as [quoted text clipped - 14 lines] > Of course! I should have /known/ that it was one of the other 68790 > hits, instead of the one that was eight out of the first ten. Well, sorry, but when you are talking Perl, then CPAN is the one and only repository for modules. There may be others, there may even be commercial ones. But CPAN is just plain the default and nobody would assume otherwise unless you mention a different source explicitely.
jue
Lew - 24 Mar 2007 05:10 GMT > Well, sorry, but when you are talking Perl, then CPAN is the one and only > repository for modules. There may be others, there may even be commercial > ones. But CPAN is just plain the default and nobody would assume otherwise > unless you mention a different source explicitely. You're talking Perl. I'm talking Java. That's where I read the message, in clj.programmer, and that's what I googled. There are a lot of "registries" for code in Java; we're not limited to just one.
Sorry us Java folks aren't always hep to the Perl culture.
-- Lew
Lew - 24 Mar 2007 05:14 GMT > Well, sorry, but when you are talking Perl, then CPAN is the one and only > repository for modules. There may be others, there may even be commercial > ones. But CPAN is just plain the default and nobody would assume otherwise > unless you mention a different source explicitely. Well, sorry, but you're the one talking Perl. I'm talking Java. I read the post in clj.programmer, and that's what informed my googling. In the Java world we're not limited to only one repository for useful code. I guess us Java folks aren't hep to Perl culture.
-- Lew
Jürgen Exner - 24 Mar 2007 05:37 GMT >> Well, sorry, but when you are talking Perl, then CPAN is the one and >> only repository for modules. There may be others, there may even be >> commercial ones. But CPAN is just plain the default and nobody would >> assume otherwise unless you mention a different source explicitely. > > Well, sorry, but you're the one talking Perl. I'm talking Java. The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...
jue
Lew - 24 Mar 2007 16:33 GMT >>> Well, sorry, but when you are talking Perl, then CPAN is the one and >>> only repository for modules. There may be others, there may even be [quoted text clipped - 3 lines] > > The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer... The OP cross-posted to a Perl group and a Java group, so /of course/ we have an obligation to speak to both audiences.
-- Lew
Lew - 24 Mar 2007 16:40 GMT >>> Well, sorry, but when you are talking Perl, then CPAN is the one and >>> only repository for modules. There may be others, there may even be [quoted text clipped - 5 lines] > > jue The OP cross-posted to a Perl group and a Java group, so /of course/ we have an obligation to speak to both audiences.
Furthermore, the OP stated that his implementation was in Java, so /of course/ a Perl answer was irrelevant.
/Of course/ based on that I expected answers to focus on Java solutions and /of course/ went looking in that space for references.
-- Lew
Sherm Pendley - 24 Mar 2007 18:45 GMT > Furthermore, the OP stated that his implementation was in Java, so /of > course/ a Perl answer was irrelevant. If he didn't want a Perl answer, he shouldn't have posted to a Perl group.
sherm--
 Signature Web Hosting by West Virginians, for West Virginians: http://wv-www.net Cocoa programming in Perl: http://camelbones.sourceforge.net
Tad McClellan - 24 Mar 2007 14:30 GMT ["Followup-To:" header set to comp.lang.perl.misc.]
> Jürgen Exner wrote: >> Well, sorry, but when you are talking Perl, then CPAN is the one and only >> repository for modules.
> Well, sorry, but you're the one talking Perl. That can happen in articles posted to the Perl newsgroup...
 Signature Tad McClellan SGML consulting tadmc@augustmail.com Perl programming Fort Worth, Texas
Alex - 26 Mar 2007 08:29 GMT > There were so many that came up when I googled that I just picked one at > random, on the basis that it was eight out of the first ten sites that popped > up in my search.
>> Of course I am referring to the module Text::CSV (or one of its cousins): >> http://search.cpan.org/search?query=text%3A%3Acsv&mode=all
> Of course! I should have /known/ that it was one of the other 68790 hits, > instead of the one that was eight out of the first ten. If you search for "Text::CSV" on Google, the very first hit you get is a link to the CPAN-module. The hxtt.com-link you send doesn't even contain the search string and does not appear in the first hundred hits. So yes, you should have /known/ that you should at least check your spelling.
 Signature Alex e-mail: Domain is iki dot fi. Local-part is alext. local-part at domain
Lew - 26 Mar 2007 12:36 GMT >> There were so many that came up when I googled that I just picked one at >> random, on the basis that it was eight out of the first ten sites that popped [quoted text clipped - 10 lines] > the search string and does not appear in the first hundred hits. So yes, > you should have /known/ that you should at least check your spelling. I searched on "Java Text::CSV" and got the hits I said I got - none of the first ten hits were CPAN, eight of them were HXTT. Those were my results. I said what I said based on what I experienced. You can make all the claims you want on what I would get if I searched; I told you what I did get when I searched.
-- Lew
Alex - 26 Mar 2007 13:22 GMT > I searched on "Java Text::CSV" and got the hits I said I got - none of the > first ten hits were CPAN, eight of them were HXTT. Those were my results. I > said what I said based on what I experienced. You can make all the claims you > want on what I would get if I searched; I told you what I did get when I searched. Yes, but you didn't tell us /what/ you searched for. Unless stated otherwise, it is reasonable to expect that the search string is "Text::CSV" and nothing else. Your touchy reaction to "being expected to know" what link to follow is irrelevant unless we know what you're searching for. Ah well... Peace!
 Signature Alex e-mail: Domain is iki dot fi. Local-part is alext. local-part at domain
Lew - 26 Mar 2007 22:16 GMT >> I searched on "Java Text::CSV" and got the hits I said I got - none of the >> first ten hits were CPAN, eight of them were HXTT. Those were my results. I [quoted text clipped - 6 lines] > know" what link to follow is irrelevant unless we know what you're > searching for. Ah well... Peace! And given that the OP asked for a Java solution it is reasonable to assume that that was the space of the answer. /Of course/ I didn't think a Perl solution was intended in that context.
I was surprised by the supercilious and contemptuous, not to say parochial attitude of the Perl respondents to that request.
-- Lew
Charlton Wilbur - 26 Mar 2007 22:38 GMT >>>>> "L" == Lew <lew@nospam.lewscanon.com> writes: L> I was surprised by the supercilious and contemptuous, not to L> say parochial attitude of the Perl respondents to that request.
The Perl newsgroup and other support fora are inundated with requests from people who think that "Perl" and "regular expression" are synonymous, and participants in those fora grow tired of debunking that canard and explaining parsing to people who want a quick answer. (No, you *can't* reliably parse HTML with a single regular expression. Not even if you beg.)
Charlton
 Signature Charlton Wilbur cwilbur@chromatico.net
Alex - 27 Mar 2007 08:07 GMT > And given that the OP asked for a Java solution it is reasonable to assume > that that was the space of the answer. /Of course/ I didn't think a Perl > solution was intended in that context. And why? The OP was crossposted to a Perl ng, which implies - unless stated otherwise - that a solution in either language is acceptable. Otherwise, why crosspost?
> I was surprised by the supercilious and contemptuous, not to say parochial > attitude of the Perl respondents to that request. Bah! Pot, kettle, black. You excluded non-Java solutions without having the courtesy of mentioning it and got all offended when informed that your scope was too narrow.
 Signature Alex e-mail: Domain is iki dot fi. Local-part is alext. local-part at domain
Lew - 27 Mar 2007 14:37 GMT >> And given that the OP asked for a Java solution it is reasonable to assume >> that that was the space of the answer. /Of course/ I didn't think a Perl [quoted text clipped - 10 lines] > the courtesy of mentioning it and got all offended when informed that > your scope was too narrow. If characterizing the conversation that way makes you feel better ...
-- Lew
Patricia Shanahan - 27 Mar 2007 22:16 GMT >> And given that the OP asked for a Java solution it is reasonable to assume >> that that was the space of the answer. /Of course/ I didn't think a Perl [quoted text clipped - 3 lines] > stated otherwise - that a solution in either language is acceptable. > Otherwise, why crosspost? In this particular case, it was stated otherwise. From the base message of the thread: "I am cross posting this in the Perl and Java groups because, my implementation is in Java, but Perl users use regexp far more frequently."
Patricia
Tad McClellan - 27 Mar 2007 23:05 GMT ["Followup-To:" header set to comp.lang.perl.misc.]
>>> And given that the OP asked for a Java solution it is reasonable to assume >>> that that was the space of the answer. /Of course/ I didn't think a Perl [quoted text clipped - 8 lines] > because, my implementation is in Java, but Perl users use regexp far > more frequently." I noted that in the OP and took the appropriate action some time later.
 Signature Tad McClellan SGML consulting tadmc@augustmail.com Perl programming Fort Worth, Texas
Alex - 28 Mar 2007 07:07 GMT > In this particular case, it was stated otherwise. From the base message > of the thread: "I am cross posting this in the Perl and Java groups > because, my implementation is in Java, but Perl users use regexp far > more frequently." I know what the OP said, and he's just stating that his implementation is in Java. He's not saying that all the other solutions are useless. Many of us are proficient in many languages and he might be able to translate that solution or indeed use the language of the proposed solution. In short: if you post in a Perl ng, you're going to get a Perl answer. It's the same with other languages.
I'm done with this thread.
 Signature Alex e-mail: Domain is iki dot fi. Local-part is alext. local-part at domain
Patricia Shanahan - 24 Mar 2007 05:31 GMT >>>> I need to parse either a CSV or a Tab delimited file, BUT I need to >>>> keep the delimiting token -- I am parsing these files as generated [quoted text clipped - 6 lines] > Of course I am referring to the module Text::CSV (or one of its cousins): > http://search.cpan.org/search?query=text%3A%3Acsv&mode=all There is some confusion here because of the cross posting between Java and Perl newsgroups.
The original message says "I am cross posting this in the Perl and Java groups because, my implementation is in Java, but Perl users use regexp far more frequently."
I don't think that was a really good idea because of the confusion, because it appears to assume a regular expression solution, and because in any case many Java programmers know regular expressions.
Text::CSV is a Perl module, which I don't think will help given the question. On the other hand, I think the principle of looking for an existing implementation is a good one. Parsing CSV is hardly a unique requirement.
Google for 'java csv' got several promising looking hits - I think the OP should do the search and compare the results to the requirements.
Patricia
Stefan Ram - 24 Mar 2007 06:07 GMT >I need to parse either a CSV or a Tab delimited file, BUT I need to Whatever you want to parse, you need a specification of the syntax - which usually is given in EBNF.
>generated from excel This is not such a specification.
>none are correct Correct accordint to which syntax specification?
>1) continuous string or with space What is a »continous string«? Is »continuous string or with space« supposed to be English?
You do not seem to be able to describe what you need to parse in English, nor in ENBF. So how are you ever going to implement it?
>That's my understanding of the cases. My (German) version of Excel does not use the comma as a separator at all, but the semicolon. So, a CSV file generated by it for two cells containing »1« and »2«, respectively, is:
1;2
I suggest the following reading:
http://secretgeek.net/csv_trouble.asp
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|