Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / October 2005

Tip: Looking for answers? Try searching our database.

Recognising file type (ascii/binary)

Thread view: 
Bruce Lee - 27 Oct 2005 16:05 GMT
Is there any easy way to get Java to determine whether a file is a binary
file or plain text ascii file?
Matt Humphrey - 27 Oct 2005 23:01 GMT
> Is there any easy way to get Java to determine whether a file is a binary
> file or plain text ascii file?

Files are simply sequences of (binary) bytes--there's no way to tell whether
it's supposed to contain only bytes that represent printable ascii (or
unicode) or any particular binary pattern.  You can read the file to find
out--if you find values that signify unlikely or non-printable characters
you can deem the file binary or corrupt.  Similarly, there are heuristics
(based on convention) for guessing the "type" of the file based on the first
few bytes, but there's no guarantee these are correct either. (And files
with 2-byte UNICODE characters can really confuse things.)

Of course, you could require that text files end in "txt" or something--it's
no worse than any of the above and significantly easier.

What are you trying to do?

Cheers,
Oliver Wong - 28 Oct 2005 21:15 GMT
>> Is there any easy way to get Java to determine whether a file is a binary
>> file or plain text ascii file?
[quoted text clipped - 10 lines]
> Of course, you could require that text files end in "txt" or
> something--it's no worse than any of the above and significantly easier.

   Matt Humphrey is completely correct. However as an additional check to
the heuristic of looking for unprintable characters, another trick is to
check if the newline string is consistent. It should always be either "\n"
(for UNIX-like systems), "\r" (for Mac-like systems) or "\r\n" (for
Windows-like systems). If the file starts switching around between these, it
probably isn't a valid ASCII file on any of the above three platforms.

   You could also disregard 2-byte UNICODE characters as being "non-ASCII",
and lump them in with the category of "binary files".

   - Oliver
Bruce Lee - 30 Oct 2005 05:24 GMT
> > Is there any easy way to get Java to determine whether a file is a binary
> > file or plain text ascii file?
[quoted text clipped - 14 lines]
>
> Cheers,

To see if a url is binary or not without relying on the header.

I'm using something like this:

protected boolean isBinary(String url){

  boolean isbin=false;
  java.io.InputStream in=null;

try{
   URL bin_url = new URL(url);

   in = bin_url.openStream();
    BufferedReader r = new BufferedReader(new InputStreamReader(in));

    char [] cc= new char[255]; //do a peek
    r.read(cc,0,255);

    double prob_bin=0;

    for(int i=0; i<cc.length; i++){
     int j = (int)cc[i];

     if(j<32 || j>127){ //with chinese and other type languages it might
flag them as binary - need another check ideaaly
      prob_bin++;
     }

    }

    double pb = prob_bin/255;
    if(pb>0.5){
    // System.out.println("probably binary at "+pb);
     isbin= true;
    }

    }

    in.close();

 }catch(Exception ee){
  System.out.println("WARN! Couldn't find isBinary() content-"+url);
  isbin= false; //error - likely broken link - so return false
 }

 try{
  in.close();
 }catch(Exception E){}

 System.out.println("url isBinary():"+url+":"+isbin);
 return isbin;

 }

I read somewhere that finding \n's might work as well.

Also, are ASCII 7bit and binary 8bit or something? Is there a way to find
this out - like analyse a byte?
Oliver Wong - 31 Oct 2005 19:26 GMT
> Also, are ASCII 7bit and binary 8bit or something?

   There is not "bit length" associated with the concept of "binary". The
question is equivalent to "Is decimal 5 digits long or 7 digits long?" A
number written in decimal can be any number of digits long.

> Is there a way to find
> this out - like analyse a byte?

   This is reminiscent of an discussion Roedy and I had about ASCII versus
binary formats. My position was that all data stored on a computer is stored
in binary (i.e. they are stored using bits), and one form of binary encoding
is called "ASCII". It was was a poor choice of wording to use "binary" to
mean "non-ASCII".

   I'm assuming you don't directly care whether a given bitstream is ASCII
or non-ASCII; rather, you want this information so that you can solve
another problem. What is the real problem you are trying to solve? Perhaps
we can offer you solutions that don't involve distinguishing between ASCII
and non-ASCII bitstreams.

   - Oliver

* The reason you may want to avoid distinguishing ASCII and non-ASCII
bitstream is that in general, it is completely impossible. There may exist
binary file formats out there which, given appropriate data to represent,
yeild bits which can legally be decoded into only printable characters using
the ASCII table, but that the semantic information in the file was never
meant to be text.
Roedy Green - 30 Oct 2005 06:53 GMT
On Thu, 27 Oct 2005 15:05:19 GMT, "Bruce Lee"
<blah@blahbllbllahblah.com> wrote, quoted or indirectly quoted someone
who said :

>Is there any easy way to get Java to determine whether a file is a binary
>file or plain text ascii file?

A practical test is to scan the first N bytes for a 0.  If you find
one it is a binary, if not text.

It actually becomes a judgment call.

Let as say you define a text file as containing only 7-bit ASCII, no
control chars but \t space \n \r.

Then you find an 0x01 char somewhere in the file.  Does that make it a
binary format?

Unfortunately not all OS's track the format/MIME etc of each file.
There is no universal scheme of embedded id signatures. It is a mess.
You have to do something seat of the pants yourself.

You can't even tell which encoding is used for a pure text file.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.