Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / March 2006

Tip: Looking for answers? Try searching our database.

auto-detecting the character set encoding of a text file

Thread view: 
martin.gerner@gmail.com - 28 Mar 2006 23:35 GMT
Hi,

I just wanted to say that I'm new here, so I excuse myself directly in
case I make any mistake :)

My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)

Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.

/Martin Gerner
Roedy Green - 29 Mar 2006 03:03 GMT
>Is there some way I can do this? Some of the encodings I suspect I will
>come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
>know that no others might be present.

Nothing simple like an encoding field. See
http://mindprod.com/projects/encodingidentification.html
for some approaches.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Martin Gerner - 29 Mar 2006 14:13 GMT
>>Is there some way I can do this? Some of the encodings I suspect I will
>>come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
[quoted text clipped - 3 lines]
> http://mindprod.com/projects/encodingidentification.html
> for some approaches.

Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?

To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.

/Martin Gerner
Roedy Green - 29 Mar 2006 19:05 GMT
On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner@nospam.com> wrote, quoted or indirectly quoted someone
who said :

>Unfortunately, this didn't help me much.. So I take it that there is no
>nifty little class I can download that will do this detection for me?

Exactly. It is a messy problem.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 29 Mar 2006 19:24 GMT
On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner@nospam.com> wrote, quoted or indirectly quoted someone
who said :

>To clarify, the files I will be working with are _not_ HTML or XML files,
>but rather standard-text log files from IM clients.

If you have control over the creating of these files, you could put
the encoding on the front of the file  followed by a  \n. That would
make your job much easier. Or you could tell everyone to use UTF-8
which would make the problem disappear.

You might also do it by tracking the source of the file. You figure
out manually which encoding each source uses over which date range.

The habit of not recording the encoding goes way back. The idea was
documents were local and all encoded the same way. You did not
exchange documents with others, of if you did, you exchanged a whole
tape full all the same, so again the problem of identification did not
come up.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Thomas Weidenfeller - 29 Mar 2006 14:32 GMT
> My problem is that I have a bunch of text files with various
> character-set encodings, and I would need a method for detecting what
[quoted text clipped - 4 lines]
> come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
> know that no others might be present.

You can't in a general way. You have to know the encodings to be sure.
You can apply some heuristics to guess an encoding. But it will be a guess.

/Thomas
Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.