Hi there.
I need some help with CSV files. I have an application where the data
to be manipulated is in CSV files. While this files have a small number
of entries there is no problem in searching sequentially. But this
files tend to get bigger slowing down the app performance. Can someone
give me any pointers on how to improve this? Database is not an option.
Any links to CSV manipulating libraries would be welcome too.
Its a web application, running on Tomcat 4.1.31.
Thanks
First, don't multipost.
> I need some help with CSV files. I have an application where the data
> to be manipulated is in CSV files. While this files have a small number
> of entries there is no problem in searching sequentially. But this
> files tend to get bigger slowing down the app performance. Can someone
> give me any pointers on how to improve this?
Without getting rind of the sequential search? And for a large amount of
data? Almost impossible.
The large amount of data possibly prevents to load the whole file into
memory at once. Which could somewhat speed up repeated sequential
searches. However, even then a sequential search will (slightly skewed
because of record size) by averrage take n/2 attempts to find a random
entry, n beeing the total number of records.
> Database is not an option.
That's bad. Some database technology could help here, e.g. indexing the
file in one run, and then using the index for searching. But then you
would have to keep the file and the index in synch. Something a database
is much better at than some handcrafted index mechanism.
> Any links to CSV manipulating libraries would be welcome too.
Optimised libraries can only do so much. If you don't want to get rid of
the sequential search you will again hit the wall pretty fast. Just
assume you find a library which is twice as fast as your current
library, then you will be back to square one if the file size doubles.
Not a real breakthrough. And, assuming you have a good implementation,
finding a library twice as fast might be impossible.
> Its a web application, running on Tomcat 4.1.31.
It doesn't matter. The application's design is faulty. Its not the
tools, its the design.
/Thomas

Signature
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
marcosm - 12 Jan 2006 13:28 GMT
I do want to get rid of the sequential search. I just don't know the
best way to do it without using a database system. I know I'd have to
use binary files and implement some hashing methods. is this correct.
Can you give me some pointers on this direction?
Sorry for the multipost. Living and learning.
Thanks
Ingo R. Homann - 12 Jan 2006 14:01 GMT
Hi,
> I do want to get rid of the sequential search. I just don't know the
> best way to do it without using a database system. I know I'd have to
> use binary files and implement some hashing methods. is this correct.
> Can you give me some pointers on this direction?
>
> Sorry for the multipost. Living and learning.
As Thomas said: If you do not want to read the whole file to memory and
if you do not want to store an index to disk (which is one step in the
direction of implementing a database yourself) and - especially - if you
do not want to do several searches, an index wont help (because creating
an index needs a full sequential access).
Can you explain a litte more what you are trying to do? Do you need
several searches, or only one? Are you "allowed" to store an index to
disk? Can the file be read into memory completely, or is it too large?
Why is a database not an option? What is the contents of the file -
could the data be stored in a better datastructure (e.g. HahsMap)? Do
you need full text search? ...?
Ciao,
Ingo
marcosm - 12 Jan 2006 14:30 GMT
Thanks for your answer.
Yes, I'm allowed to store an index to disk. (I think this is the idea,
i'm just not sure on how to do this).
Right now the file can be read into memory completely, but it's going
to grow. And when it does it could easily not fit the memory. It could
contain thousands (around 9.000 entries).
I can't use a database (I know this is the best option) because of
project specific reasons (not technical ones of course).
The contents of the file could be stored in a HashMap, but to do that
wouldn't i still have to read the entries from the file?
I don't think I need to full text search. All I have to do is, given a,
say a username, find the entrie in the csv file associated to that
username, and then find the devices associated with that username in
another csv file.
Thanks for your kind help so far.
Ingo R. Homann - 12 Jan 2006 15:00 GMT
Hi,
is it necessary, that the file format is csv? If yes, why? Because other
programs will write to the file and/or the file sould be also edited
manually? If yes, than you will have to face the problem, that the index
file may not be up to date which may corrupt your whole data.
If it does not have to be csv, then perhaps storing everything to a
HashMap may be the best solution. 9000 entries should be no problem at
all. Then, the Map has to be read only once at startup.
On the other hand, as you yourself mentioned, a database would be the
best solution. If the requirement is that it must be a standalone
application and no additional database has to be installed, you may
think about a Java-integrated database. There are some (although I did
not work with any until now and cannot give you any recommendations).
Ciao,
Ingo
marcosm - 12 Jan 2006 15:13 GMT
Ingo
It doesn't necessarily have to be a csv format. Although if it isn't
I'm gonna use an extra step to convert from csv to another format.
Can give some more details about the hashmap aproach? I'm somewhat
familiar with the HashMap class, but I not sure how I'd populate it in
this case. What would the file have to look like in case I'd decide to
go that way?
Thanks again
Ciao
Ingo R. Homann - 13 Jan 2006 10:26 GMT
Hi,
> Ingo
>
[quoted text clipped - 6 lines]
> What would the file have to look like in case I'd decide to
> go that way?
The file can be csv (if it is ok to read it at program-startup, which
you cannot avoid without the problems mentinoed in my last posting).
At program start, you read the file line by line (or perhaps by using
the csvjdbc-driver jlp mentioned) and fill the map:
Map<String, User> map=new HashMap<String,User>();
while((line=in.readLine())!=null) {
StringTokenizer st=new StringTokenizer(line,";");
String name=st.nextToken();
String password=st.nextToken();
...
User user=new User(name,password,...);
map.put(name,user);
}
Afterwards, accessing it, is quite fast:
User u=map.get("marcosm");
Hth,
Ingo
marcosm - 13 Jan 2006 12:10 GMT
Hi,
Is there any significant difference between using StringTokenizer and
the split method of the String class?
Thanks again
marcosm
marcosm - 13 Jan 2006 12:25 GMT
Hi all
Nevermind my last post, i found the answer to that question in the Java
API. Thanks all for now.
Best regards
marcosm
Roedy Green - 13 Jan 2006 20:19 GMT
>Is there any significant difference between using StringTokenizer and
>the split method of the String class?
split lets you split on more complicated patterns, not just single
separator letters.
split is easier to code.
StringTokenizer is faster.
split is only available in JDK 1.4+

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
jlp - 12 Jan 2006 15:15 GMT
look at a CSV JDBC driver at :
https://xlsql.dev.java.net/
or
http://csvjdbc.sourceforge.net/
> Thanks for your answer.
>
[quoted text clipped - 17 lines]
>
> Thanks for your kind help so far.
Boris Stumm - 12 Jan 2006 15:18 GMT
> Yes, I'm allowed to store an index to disk. (I think this is the idea,
> i'm just not sure on how to do this).
[...]
> I can't use a database (I know this is the best option) because of
> project specific reasons (not technical ones of course).
Maybe HSQLDB can help you. It is an embedded database which can also
operate on text files. According to the guide it also allows indexes
on these "text tables".
Noone would see that you use a database system ;-)
Have a look:
http://hsqldb.org/
http://hsqldb.org/doc/guide/ch06.html
I myself have no experience with hsqldb, but the documentation looked
promising.
Greetings,
Boris Stumm