Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / October 2007

Tip: Looking for answers? Try searching our database.

determining character encoding format of a file

Thread view: 
Alan - 06 Oct 2007 21:51 GMT
Is there any easy way to determine what character encoding format
(e.g., UTF-8) a text file uses?

Thanks, Alan
RedGrittyBrick - 07 Oct 2007 01:00 GMT
>     Is there any easy way to determine what character encoding format
> (e.g., UTF-8) a text file uses?

Easy? Not in general.

<http://developers.sun.com/global/technology/standards/reference/faqs/determining
-file-encoding.html
>
<http://codesnipers.com/?q=node/68>
Arne Vajhøj - 07 Oct 2007 01:32 GMT
>     Is there any easy way to determine what character encoding format
> (e.g., UTF-8) a text file uses?

Not in general.

For ISO-8859-1 versus UTF-8 for a western language you may make
a qualified guess.

See attached code as a stating point (note that the
code is designed to identify text in danish).

Arne

=============================

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class CharSetGuesser {
    public static String guess(String filename) throws IOException {
        int[] freq = new int[256];
        InputStream is = new FileInputStream(filename);
        int c;
        while((c = is.read()) >= 0) {
            freq[c]++;
        }
        is.close();
        if((freq[197] + freq[198] + freq[200] +
            freq[201] + freq[203] + freq[216] +
            freq[229] + freq[230] + freq[232] +
            freq[233] + freq[235] + freq[248]) >
           (freq[133] + freq[134] + freq[136] +
            freq[137] + freq[139] + freq[152] +
            freq[165] + freq[166] + freq[168] +
            freq[169] + freq[171] + freq[184] +
            freq[195])) {
            return "ISO-8859-1";
        } else {
            return "UTF-8";
        }
    }
    public static void main(String[] args) throws Exception {
        System.out.println(guess("C:\\iso-8859-1.txt"));
        System.out.println(guess("C:\\utf-8.txt"));
    }
}
Alan - 07 Oct 2007 01:49 GMT
Thank you.  Actually, my interest is in Arabic.
Arne Vajhøj - 07 Oct 2007 04:32 GMT
>    Actually, my interest is in Arabic.

:-)

Try take a relevant text and store it in the relevant encodings
and then do some statistics on bytes and see if there are
some simple rules that can identify the encoding.

Arne
Mike Schilling - 07 Oct 2007 08:09 GMT
>    Is there any easy way to determine what character encoding format
> (e.g., UTF-8) a text file uses?

Some UTF-8 files (esp.n Microsoft OSs) start with the Byte Order Mark (BOM),
which is the unicode character U+FEFF, encoded in UTF-8.  Other than that,
no.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.