Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2007

Tip: Looking for answers? Try searching our database.

parsing a tab delimited or CSV, but keep the delimiter

Thread view: 
Sideswipe - 22 Mar 2007 21:04 GMT
I know this question has been asked before, and believe me I checked
the newsgroup and web extensively before asking, but I think my needs
are slightly different.

I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.

I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.

Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)

private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");

So, you have the cases of:

1) continuous string or with space   -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the  double
comma is also a token
3) blank cells are just a single comma ,

That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters
Sideswipe - 22 Mar 2007 21:26 GMT
It also looks like double comma before a ""  like this ,,"" -- the ,,
counts as 1 token
Chris Uppal - 22 Mar 2007 22:20 GMT
> I need to parse either a CSV or a Tab delimited file, BUT I need to
> keep the delimiting token -- I am parsing these files as generated
[quoted text clipped - 4 lines]
> implementation is in Java, but Perl users use regexp far more
> frequently.

I do not believe that regular expressions (even the weird "enhanced" regular
expressions that Java has apparently borrowed from Perl) are a suitable tool
for this job.  Even if you could find some way of expressing the complex logic
of CSV parsing using the extended regexps, your resulting code would be totally
incomprehensible.

A useful test case for your parser is that /any/ CSV file can (treated as
a single string) be used as the contents of a field within another CSV.
Embedded newlines, arbitrarily deep nesting, etc, are all allowed.  That power
hardly makes it any harder to parse CVS /unless/ you've started out by relying
on an inadequate tool like regexps.

Also, don't forget to test your code with files generated from Excel in places
where the numeric thousands separator is not ','.

   -- chris

P.S.  Although I don't much like Java's regexps (I think they are /grossly/
overused), it seems that I like them rather better than OE's spellchecker,
which suggests that replace "regexp" with "rage" and "regexps" with "creeps".
True AI is clearly only just around the corner ;-)
Daniel Pitts - 22 Mar 2007 22:30 GMT
> I know this question has been asked before, and believe me I checked
> the newsgroup and web extensively before asking, but I think my needs
[quoted text clipped - 8 lines]
> implementation is in Java, but Perl users use regexp far more
> frequently.
Cross posting is fine, but you should add a follow up header.  Which I
have (to comp.lang.java.programmer)

> Here are the 3 different REGEX expressions I have found /created but
> none are correct. The only certainty I can get is to get rid of all
[quoted text clipped - 18 lines]
> That's my understanding of the cases. The logic should be IDENTICAL
> for tab delimited and simply substitute characters

I'm not sure that a regex is good enough to do everything...

Anyway, here are the cases that I can think of, ignoring the
delimiters.

Field value:
Field value: ,
Field value: "
Field value: a,b
Field value: "a and b"
Field value: 6"3

What are the encodings of this?

I'm guessing that

Field value:
Encoded value:
Field value: ,
Encoded value: ","
Field value: "
Encoded value: ""
Field value: a,b
Encoded value: "a,b"
Field value: "a and b"
Encoded value: ""a and b""
Field value: 6"3
Encoded value: 6""3

You can verify these cases in excel.
If those ARE the correct cases, then this would work:

import java.util.List;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class ParseCSV {
   final static String quoted = "\"(:?[^\"]|\"\")+\"";
   public static List<String> parseCSV(String csv, String delim) {
       final  Pattern NEXT_COLUMN = nextColumnRegex(delim);
       final List<String> strings = new ArrayList<String>();
       final Matcher matcher = NEXT_COLUMN.matcher(csv);
       while (!matcher.hitEnd() && matcher.find()) {
           String match = matcher.group(1);
           if (match.matches(quoted)) {
               match = match.substring(1, match.length() - 1);
           }
           match = match.replaceAll("\"\"", "\"");
           strings.add(match);
       }
       return strings;
   }

   private static Pattern nextColumnRegex(String comma) {
       String unquoted = "(:?[^\"" + comma + "]|\"\")*";
       String ending = "(:?" + comma +"|$)";
       return Pattern.compile('(' + quoted + '|' + unquoted + ')' +
ending);
   }

   public static void main(String[] args) {
       String csv = ",\",\",\"\",\"a,b\",\"\"a and b\"\",6\"\"3";
       List<String> result = parseCSV(csv, ",");
       for (String col : result) {
           System.out.println("Field value:" + col);
       }
   }

}
Jürgen Exner - 22 Mar 2007 22:34 GMT
> I need to parse either a CSV or a Tab delimited file, BUT I need to
> keep the delimiting token -- I am parsing these files as generated
> from excel and the user expects them to process EXACTLY as it appears
> in the spreadsheet.

Text::CSV should do the job quite nicely for regular CSV files.

> Here are the 3 different REGEX expressions I have found /created but
> none are correct. [...]

That doesn't suprise me. While REs are quite powerful indeed, there are
many, many tasks, where you need a more complex logic than REs can provide.
Even if you have custom requirements and you need a custom version of a CSV
parser I would still start with one of the Text::CSV modules and customize
that instead of rolling my own code from scratch and repeating all the
mistakes that have been fixed in the ready-made modules a long time ago.

jue
Lew - 23 Mar 2007 22:28 GMT
>> I need to parse either a CSV or a Tab delimited file, BUT I need to
>> keep the delimiting token -- I am parsing these files as generated
>> from excel and the user expects them to process EXACTLY as it appears
>> in the spreadsheet.
>
> Text::CSV should do the job quite nicely for regular CSV files.

Are you referring to <http://www.hxtt.com/text.html>?

-- Lew
Jürgen Exner - 23 Mar 2007 23:19 GMT
>>> I need to parse either a CSV or a Tab delimited file, BUT I need to
>>> keep the delimiting token -- I am parsing these files as generated
[quoted text clipped - 4 lines]
>
> Are you referring to <http://www.hxtt.com/text.html>?

No, why would I?
Of course I am referring to the module Text::CSV (or one of its cousins):
http://search.cpan.org/search?query=text%3A%3Acsv&mode=all

jue
Lew - 23 Mar 2007 23:32 GMT
>>>> I need to parse either a CSV or a Tab delimited file, BUT I need to
>>>> keep the delimiting token -- I am parsing these files as generated
[quoted text clipped - 4 lines]
>
> No, why would I?

There were so many that came up when I googled that I just picked one at
random, on the basis that it was eight out of the first ten sites that popped
up in my search.

> Of course I am referring to the module Text::CSV (or one of its cousins):
> http://search.cpan.org/search?query=text%3A%3Acsv&mode=all

Of course! I should have /known/ that it was one of the other 68790 hits,
instead of the one that was eight out of the first ten.

-- Lew
Jürgen Exner - 23 Mar 2007 23:38 GMT
>>>>> I need to parse either a CSV or a Tab delimited file, BUT I need
>>>>> to keep the delimiting token -- I am parsing these files as
[quoted text clipped - 14 lines]
> Of course! I should have /known/ that it was one of the other 68790
> hits, instead of the one that was eight out of the first ten.

Well, sorry, but when you are talking Perl, then CPAN is the one and only
repository for modules. There may be others, there may even be commercial
ones. But CPAN is just plain the default and nobody would assume otherwise
unless you mention a different source explicitely.

jue
Lew - 24 Mar 2007 05:10 GMT
> Well, sorry, but when you are talking Perl, then CPAN is the one and only
> repository for modules. There may be others, there may even be commercial
> ones. But CPAN is just plain the default and nobody would assume otherwise
> unless you mention a different source explicitely.

You're talking Perl. I'm talking Java. That's where I read the message, in
clj.programmer, and that's what I googled. There are a lot of "registries" for
code in Java; we're not limited to just one.

Sorry us Java folks aren't always hep to the Perl culture.

-- Lew
Lew - 24 Mar 2007 05:14 GMT
> Well, sorry, but when you are talking Perl, then CPAN is the one and only
> repository for modules. There may be others, there may even be commercial
> ones. But CPAN is just plain the default and nobody would assume otherwise
> unless you mention a different source explicitely.

Well, sorry, but you're the one talking Perl. I'm talking Java. I read the
post in clj.programmer, and that's what informed my googling. In the Java
world we're not limited to only one repository for useful code. I guess us
Java folks aren't hep to Perl culture.

-- Lew
Jürgen Exner - 24 Mar 2007 05:37 GMT
>> Well, sorry, but when you are talking Perl, then CPAN is the one and
>> only repository for modules. There may be others, there may even be
>> commercial ones. But CPAN is just plain the default and nobody would
>> assume otherwise unless you mention a different source explicitely.
>
> Well, sorry, but you're the one talking Perl. I'm talking Java.

The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...

jue
Lew - 24 Mar 2007 16:33 GMT
>>> Well, sorry, but when you are talking Perl, then CPAN is the one and
>>> only repository for modules. There may be others, there may even be
[quoted text clipped - 3 lines]
>
> The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...

The OP cross-posted to a Perl group and a Java group, so /of course/ we have
an obligation to speak to both audiences.

-- Lew
Lew - 24 Mar 2007 16:40 GMT
>>> Well, sorry, but when you are talking Perl, then CPAN is the one and
>>> only repository for modules. There may be others, there may even be
[quoted text clipped - 5 lines]
>
> jue

The OP cross-posted to a Perl group and a Java group, so /of course/ we have
an obligation to speak to both audiences.

Furthermore, the OP stated that his implementation was in Java, so /of course/
a Perl answer was irrelevant.

/Of course/ based on that I expected answers to focus on Java solutions and
/of course/ went looking in that space for references.

-- Lew
Sherm Pendley - 24 Mar 2007 18:45 GMT
> Furthermore, the OP stated that his implementation was in Java, so /of
> course/ a Perl answer was irrelevant.

If he didn't want a Perl answer, he shouldn't have posted to a Perl group.

sherm--

Signature

Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net

Tad McClellan - 24 Mar 2007 14:30 GMT
["Followup-To:" header set to comp.lang.perl.misc.]

> Jürgen Exner wrote:
>> Well, sorry, but when you are talking Perl, then CPAN is the one and only
>> repository for modules.

> Well, sorry, but you're the one talking Perl.

That can happen in articles posted to the Perl newsgroup...

Signature

   Tad McClellan                          SGML consulting
   tadmc@augustmail.com                   Perl programming
   Fort Worth, Texas

Alex - 26 Mar 2007 08:29 GMT
> There were so many that came up when I googled that I just picked one at
> random, on the basis that it was eight out of the first ten sites that popped
> up in my search.

>> Of course I am referring to the module Text::CSV (or one of its cousins):
>> http://search.cpan.org/search?query=text%3A%3Acsv&mode=all

> Of course! I should have /known/ that it was one of the other 68790 hits,
> instead of the one that was eight out of the first ten.

If you search for "Text::CSV" on Google, the very first hit you get is a
link to the CPAN-module. The hxtt.com-link you send doesn't even contain
the search string and does not appear in the first hundred hits. So yes,
you should have /known/ that you should at least check your spelling.

Signature

Alex
e-mail: Domain is iki dot fi. Local-part is alext.
       local-part at domain

Lew - 26 Mar 2007 12:36 GMT
>> There were so many that came up when I googled that I just picked one at
>> random, on the basis that it was eight out of the first ten sites that popped
[quoted text clipped - 10 lines]
> the search string and does not appear in the first hundred hits. So yes,
> you should have /known/ that you should at least check your spelling.

I searched on "Java Text::CSV" and got the hits I said I got - none of the
first ten hits were CPAN, eight of them were HXTT. Those were my results. I
said what I said based on what I experienced. You can make all the claims you
want on what I would get if I searched; I told you what I did get when I searched.

-- Lew
Alex - 26 Mar 2007 13:22 GMT
> I searched on "Java Text::CSV" and got the hits I said I got - none of the
> first ten hits were CPAN, eight of them were HXTT. Those were my results. I
> said what I said based on what I experienced. You can make all the claims you
> want on what I would get if I searched; I told you what I did get when I searched.

Yes, but you didn't tell us /what/ you searched for. Unless stated
otherwise, it is reasonable to expect that the search string is
"Text::CSV" and nothing else. Your touchy reaction to "being expected to
know" what link to follow is irrelevant unless we know what you're
searching for. Ah well... Peace!

Signature

Alex
e-mail: Domain is iki dot fi. Local-part is alext.
       local-part at domain

Lew - 26 Mar 2007 22:16 GMT
>> I searched on "Java Text::CSV" and got the hits I said I got - none of the
>> first ten hits were CPAN, eight of them were HXTT. Those were my results. I
[quoted text clipped - 6 lines]
> know" what link to follow is irrelevant unless we know what you're
> searching for. Ah well... Peace!

And given that the OP asked for a Java solution it is reasonable to assume
that that was the space of the answer. /Of course/ I didn't think a Perl
solution was intended in that context.

I was surprised by the supercilious and contemptuous, not to say parochial
attitude of the Perl respondents to that request.

-- Lew
Charlton Wilbur - 26 Mar 2007 22:38 GMT
>>>>> "L" == Lew  <lew@nospam.lewscanon.com> writes:

   L> I was surprised by the supercilious and contemptuous, not to
   L> say parochial attitude of the Perl respondents to that request.

The Perl newsgroup and other support fora are inundated with requests
from people who think that "Perl" and "regular expression" are
synonymous, and participants in those fora grow tired of debunking
that canard and explaining parsing to people who want a quick answer.
(No, you *can't* reliably parse HTML with a single regular expression.
Not even if you beg.)

Charlton

Signature

Charlton Wilbur
cwilbur@chromatico.net

Alex - 27 Mar 2007 08:07 GMT
> And given that the OP asked for a Java solution it is reasonable to assume
> that that was the space of the answer. /Of course/ I didn't think a Perl
> solution was intended in that context.

And why? The OP was crossposted to a Perl ng, which implies - unless
stated otherwise - that a solution in either language is acceptable.
Otherwise, why crosspost?

> I was surprised by the supercilious and contemptuous, not to say parochial
> attitude of the Perl respondents to that request.

Bah! Pot, kettle, black. You excluded non-Java solutions without having
the courtesy of mentioning it and got all offended when informed that
your scope was too narrow.

Signature

Alex
e-mail: Domain is iki dot fi. Local-part is alext.
       local-part at domain

Lew - 27 Mar 2007 14:37 GMT
>> And given that the OP asked for a Java solution it is reasonable to assume
>> that that was the space of the answer. /Of course/ I didn't think a Perl
[quoted text clipped - 10 lines]
> the courtesy of mentioning it and got all offended when informed that
> your scope was too narrow.

If characterizing the conversation that way makes you feel better ...

-- Lew
Patricia Shanahan - 27 Mar 2007 22:16 GMT
>> And given that the OP asked for a Java solution it is reasonable to assume
>> that that was the space of the answer. /Of course/ I didn't think a Perl
[quoted text clipped - 3 lines]
> stated otherwise - that a solution in either language is acceptable.
> Otherwise, why crosspost?

In this particular case, it was stated otherwise. From the base message
of the thread: "I am cross posting this in the Perl and Java groups
because, my implementation is in Java, but Perl users use regexp far
more frequently."

Patricia
Tad McClellan - 27 Mar 2007 23:05 GMT
["Followup-To:" header set to comp.lang.perl.misc.]
>>> And given that the OP asked for a Java solution it is reasonable to assume
>>> that that was the space of the answer. /Of course/ I didn't think a Perl
[quoted text clipped - 8 lines]
> because, my implementation is in Java, but Perl users use regexp far
> more frequently."

I noted that in the OP and took the appropriate action some time later.

Signature

   Tad McClellan                          SGML consulting
   tadmc@augustmail.com                   Perl programming
   Fort Worth, Texas

Alex - 28 Mar 2007 07:07 GMT
> In this particular case, it was stated otherwise. From the base message
> of the thread: "I am cross posting this in the Perl and Java groups
> because, my implementation is in Java, but Perl users use regexp far
> more frequently."

I know what the OP said, and he's just stating that his implementation
is in Java. He's not saying that all the other solutions are useless.
Many of us are proficient in many languages and he might be able to
translate that solution or indeed use the language of the proposed
solution. In short: if you post in a Perl ng, you're going to get a Perl
answer. It's the same with other languages.

I'm done with this thread.

Signature

Alex
e-mail: Domain is iki dot fi. Local-part is alext.
       local-part at domain

Patricia Shanahan - 24 Mar 2007 05:31 GMT
>>>> I need to parse either a CSV or a Tab delimited file, BUT I need to
>>>> keep the delimiting token -- I am parsing these files as generated
[quoted text clipped - 6 lines]
> Of course I am referring to the module Text::CSV (or one of its cousins):
> http://search.cpan.org/search?query=text%3A%3Acsv&mode=all

There is some confusion here because of the cross posting between Java
and Perl newsgroups.

The original message says "I am cross posting this in the Perl and Java
groups because, my implementation is in Java, but Perl users use regexp
far more frequently."

I don't think that was a really good idea because of the confusion,
because it appears to assume a regular expression solution, and because
in any case many Java programmers know regular expressions.

Text::CSV is a Perl module, which I don't think will help given the
question. On the other hand, I think the principle of looking for an
existing implementation is a good one. Parsing CSV is hardly a unique
requirement.

Google for 'java csv' got several promising looking hits - I think the
OP should do the search and compare the results to the requirements.

Patricia
Stefan Ram - 24 Mar 2007 06:07 GMT
>I need to parse either a CSV or a Tab delimited file, BUT I need to

 Whatever you want to parse, you need a specification of
 the syntax - which usually is given in EBNF.

>generated from excel

 This is not such a specification.

>none are correct

 Correct accordint to which syntax specification?

>1) continuous string or with space

 What is a »continous string«? Is »continuous string or with
 space« supposed to be English?

 You do not seem to be able to describe what you need to parse
 in English, nor in ENBF. So how are you ever going to implement it?

>That's my understanding of the cases.

 My (German) version of Excel does not use the comma as a
 separator at all, but the semicolon. So, a CSV file generated
 by it for two cells containing »1« and »2«, respectively, is:

1;2

 I suggest the following reading:

http://secretgeek.net/csv_trouble.asp


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.