Java Forum / General / June 2006
Stream and Encoding Confusion
Rhino - 05 Jun 2006 00:28 GMT A friend and I are having a friendly competition that is causing me some conceptual confusion. I am hoping someone can help me clarify things a little.
We are each writing programs to read an input file and count the number of each distinct character in the in the file; he is writing his program in Perl and I am writing mine in Java. The main output of the prgram will be a simple list that says we the program found so many of each character; we want to report the letters of the alphabet as well as accented letters, punctuation, and whitespace characters, including carriage returns and linefeeds. We have two input files at the moment, a text file and an MP3 file. There is no money or serious rivalry invoved; we are simply curious about how each will look if properly written. We also wonder how the performance will compare, although that is quite unimportant to both of us.
I have a couple of areas of confusion: a. character streams vs. byte streams b. the issue of encoding.
Since I'd like to be able to read any type of file in any language, including text files, MP3s, and many others, should I always be treating the input file as a character stream or do I need to somehow detect which ones are best read as character streams and which are best read as byte streams? If I need to treat the two types differently, how do I detect which type the input file is? I would rather not rely on the user knowing whether a file that he wants to give the program is best suited to being treated as a character stream or a byte stream. I've read the conceptual information about this in the Java Tutorial and find that it really doesn't address this issue clearly.
I'm also somewhat concerned about encoding. I honestly don't understand exactly how encoding works and apologize if this is a dumb question but this seemed like a good place to get someone to point me to a proper discussion of this issue. Do I need to know how a file is encoded before I open it and decide which kind of stream it is? Or is there some way to determine what encoding the file is using by simply examing the file? Again, I want to be able to read a file and count the characters without the provider of the file having to tell me what encoding it uses since the provider, quite likely, wouldn't know.
-- Rhino
Matt Humphrey - 05 Jun 2006 02:30 GMT >A friend and I are having a friendly competition that is causing me some >conceptual confusion. I am hoping someone can help me clarify things a [quoted text clipped - 36 lines] > the provider of the file having to tell me what encoding it uses since the > provider, quite likely, wouldn't know. This issue was addressed not long here:
http://groups.google.com/group/comp.lang.java.programmer/browse_thread/thread/1d 2a1d6bb48b681/08095f861a95f75a?lnk=st&q=%22recognising+file+type%22+group%3Acomp .lang.java.programmer&rnum=1&hl=en#08095f861a95f75a
You can find more about character encoding here
http://mindprod.com/jgloss/encoding.html
In summary, there is no way to perfectly distinguish between character and non-character data and you must be able to distinguish them in order to use the right kind of stream. All data is binary data. By convention (common agreement) some binary patterns are used to represent text, characters (including digits), numbers of all types (not digits), application data structures, etc. In particular, the conventions for characters are given names that identify the encoding--the mapping of byte values or code point numbers to specific logical characters. You can always read binary data, but you must know the encoding in order to make any sense out of it.
What makes this problem troublesome is that the identity of the encoding is not in the data itself. Well, it is for some (e.g. XML has encoding attribute and some kinds of application data files like GIF start with a specific 4-byte signature), but because it's not there for all of them there is no reliable way to distinguish whether you have (for example), a MP3 file or some other unknown type of data file. Often you're better off just checking the file extension, although that won't tell you the text encoding.
But with some smart decisions, you can often guess reasonably (if imperfectly) at the format. Check out the links above to see how this might work.
The bottom line is that you have to know the encoding in order to read the file. To simplify your problem, you could, for example, limit yourselves to one of the standard encodings (UTF-8, UTF-16) and work just with text. Or designate one or two kinds of specific file types, e.g. MP3. No one has a general-purpose interpreter that will give the correct answer to this question for every file everywhere. And if they did I can make it give the wrong answer by cooking up a data file for any new format that happens to correspond to any existing format.
Cheers, Matt Humphrey matth@ivizNOSPAM.com http://www.iviz.com/
Chris Smith - 05 Jun 2006 02:58 GMT > A friend and I are having a friendly competition that is causing me some > conceptual confusion. I am hoping someone can help me clarify things a [quoted text clipped - 3 lines] > each distinct character in the in the file; he is writing his program in > Perl and I am writing mine in Java. So clearly, the first thing you need to do is discover which character encoding your friend and you are agreeing to read. Some probable answers include:
1. ASCII, which is "US-ASCII" as a Java encoding string. However, this encoding has no accented characters.
2. ISO 8859-1, which amounts to just the lower 256 characters from the Unicode character set, adopted as a trivial single-byte encoding. Easy enough.
3. The platform default, which is what you get if you create an InputStreamReader without an encoding parameter. This is a bad idea for files, so one would hope that it's not the choice. But hey, the world is an imperfect place.
> We have two input files at the moment, a text file and an MP3 > file. Note that it is inherently nonsensical to talk about counting the number of characters, letters, accented letters, etc in an MP3 file. There is simply no meaning to those words; sorta like me telling you that I'm going to count the number of rocks that exist in the conceptual idea of peace. The MP3 file contains only bytes, and they cannot be validly interpreted as characters. The fact that this is part of your friendly competition does not bode well.
You need to find out what is meant by this. Perhaps what is meant is "if I were to pretend that the MP3 file were text, and open it using <insert some editor>, how many accented characters would I see?" In that case, you'd need to experiment with that editor and see what encoding it assumes when opening a text file. If your encoding does not cover all possible byte sequences (for example, UTF-8 will fail to decode certain combinations of trailing bytes at the end of a string), you should find out what the application is supposed to do.
I can almost guarantee, of course, that you'll discover that your friend hasn't thought about any of these things. Nevertheless, you need to know them in order to write the desired code.
> Since I'd like to be able to read any type of file in any language, > including text files, MP3s, and many others, should I always be treating the > input file as a character stream or do I need to somehow detect which ones > are best read as character streams and which are best read as byte streams? Matt posted a link to a very simple piece of code for estimating a rough score that correlates with the likelihood that a file is encoded in ASCII (or, really, any ASCII superset). There's no real correct answer here, though.
> Do I need to know how a file is encoded before I open it and > decide which kind of stream it is? Yes. Again, you could scan through the file and try to estimate, but you'd need some complex statistical methods and results from simple linguistics of various human languages, for example, to tell the difference between the various ISO 8859 encodings. It won't be easy. You are supposed to know in advance.
 Signature Chris Smith - Lead Software Developer / Technical Trainer MindIQ Corporation
Chris Smith - 05 Jun 2006 03:07 GMT > So clearly, the first thing you need to do is discover which character > encoding your friend and you are agreeing to read. It's worth noting, in case it wasn't clear, that this is not an issue with the Java programming language or API. It is the requirements, not the implementation, that are unclear. If some other API fails to make it clear that this decision must be made, then the fault lies in that other API.
 Signature Chris Smith - Lead Software Developer / Technical Trainer MindIQ Corporation
Matt Humphrey - 05 Jun 2006 15:59 GMT >> A friend and I are having a friendly competition that is causing me some >> conceptual confusion. I am hoping someone can help me clarify things a [quoted text clipped - 4 lines] >> each distinct character in the in the file; he is writing his program in >> Perl and I am writing mine in Java. <snip some bits about encodings>
>> We have two input files at the moment, a text file and an MP3 >> file. [quoted text clipped - 6 lines] > interpreted as characters. The fact that this is part of your friendly > competition does not bode well. I certainly agree that it's nonsensical to interpret an arbitrary file (of unknown type) as characters. Do you mean to say that MP3 contains no character data at all? I would expect the title and artist strings to be in there somewhere. Plenty of specific data structures (e.g. Excel, Word, GIFs) have parts that can be legitimately decoded as character data, Of course, to count those characters you would have to know in advance the file structure, where to find the strings and what their encoding is, which is where this whole thing started.
Cheers, Matt Humphrey matth@ivizNOSPAM.com http://www.iviz.com/
Chris Smith - 05 Jun 2006 16:31 GMT > I certainly agree that it's nonsensical to interpret an arbitrary file (of > unknown type) as characters. Do you mean to say that MP3 contains no [quoted text clipped - 4 lines] > structure, where to find the strings and what their encoding is, which is > where this whole thing started. Oddly enough, I stayed up most of the night thinking about that paragraph that I wrote! :) I no longer agree with it.
First and foremost, the meaning was intended to be more like: it is inherently non-sensical to interpret an MP3 file as if the file itself were a stream of characters. There is potentially some character data in an MP3, though I don't personally know whether an MP3 includes the title or author or not. That character data may well be compressed, of course, so you are perhaps unlikely to get to it merely by reading the file as if it were ASCII or something like that.
Going further, though, it's not necessarily *inherently* non-sensical to do so. It's merely that the necessary character encodings to do so -- in a way that would be sensical -- would not reside within the standard set recognized or implemented by the Java programming language. Reading an MP3 file in ISO 8859-1 or the like is non-sensical, but only because ANY encoding exhibits sense only when the software that created the file was aware of the same encoding. I would be shocked to find that there's any reasonable piece of knowledge that can be gained from knowing how many characters belong to the set of accented letters in the ISO 8859-1 interpretation of an MP3 file. However, it is of course possible to construct a meaningful textual representation of the data contained within an MP3 file, and for at least certain straight-forward ways of doing so (in fact, I tentatively believe until someone provides a good reason to the contrary, for all ways of doing so), the result may be reasonably described as a complex kind of character encoding.
In other words, I fear that I overstated the uniqueness of textual data. In fact, there's nothing special about text at all; it's just yet another in the infinite list of semantic interpretations that can be assigned to any binary file. It just happens to be common enough that Java provides implementations for some types of character encodings.
So, to be entirely clear, there is NOTHING special about text. However, you probably don't want to read an MP3 file as if it were text; and if you do, the Java standard API doesn't provide the tools to do so in particularly useful ways.
 Signature Chris Smith - Lead Software Developer / Technical Trainer MindIQ Corporation
Oliver Wong - 05 Jun 2006 17:53 GMT >>> We are each writing programs to read an input file and count the number >>> of >>> each distinct character in the in the file [...]
>>> We have two input files at the moment, a text file and an MP3 >>> file. [quoted text clipped - 15 lines] > file structure, where to find the strings and what their encoding is, > which is where this whole thing started. My first interpretation of the requirements was that the entire mp3 file be read in as a stream of bytes, and then someone decoded into a sequence of characters. It's certainly "do-able", but it's also nonsensical, as Chris pointed out.
Of course, your interpretation, Matt, of decoding the ID3 data is feasible as well. Rhino might want to get some clarification (or clarify for us, if he already knows) on this point.
- Oliver
Chris Uppal - 05 Jun 2006 11:10 GMT > We are each writing programs to read an input file and count the number of > each distinct character in the in the file; he is writing his program in [quoted text clipped - 5 lines] > file. There is no money or serious rivalry invoved; we are simply curious > about how each will look if properly written. Expanding a bit on the earlier replies...
I suggest you change the challenge a little. As it stands -- and as Chris Smith has explained -- it simply isn't a coherent task. As such it can't have a "properly written" solution in Java; the nearest you can get is a badly designed program which doesn't do what it might look (to the naive) as if it's doing.
Exactly the same problem applies to the Perl program. I don't know enough about modern Perl to know whether it is even /possible/ to solve it correctly in Perl. I'm pretty sure it was impossible when I last looked at Perl (and shuddered and looked away again quick), but Perl has changed a lot since then.
As I said, I suggest you change the terms of the challenge. Maybe the following (which /is/ well-posed) would suit you and your friend (or rival ;-).
0) Given a file, produce a list of how often each byte value occurs in it. I.e. interpret it as binary. (You may have agree in advance on whether you treat bytes as signed or unsigned).
1) Given a file, /and/ an assumption about its character encoding, produce a simple list [..etc...] Presumably the encoding would be specified on the command-line along with the file name. I suggest that you make UTF-16 the default (which may help you to keep the difference between the binary bytes in the file, and their interpretation as characters, clear in your mind).
2) (For extra credit ;-) Given a file, attempt to guess what encoding it is in, using whatever heuristics come to mind.
-- chris
Dale King - 10 Jun 2006 19:16 GMT > Exactly the same problem applies to the Perl program. I don't know enough > about modern Perl to know whether it is even /possible/ to solve it correctly > in Perl. I'm pretty sure it was impossible when I last looked at Perl (and > shuddered and looked away again quick), but Perl has changed a lot since then. I don't know for certainty with Perl, but when I was doing a Linux install this week I noticed that one of the packages installed was perl-unicode which reminded me of this thread. So it would seem that Perl has some form of support for Unicode.
 Signature Dale King
Rhino - 05 Jun 2006 15:11 GMT >A friend and I are having a friendly competition that is causing me some >conceptual confusion. I am hoping someone can help me clarify things a >little. [snip]
Thank you all for your very valuable and helpful replies to my question. I concur completely with Chris Smith that this is entirely a problem of defining the requirements of the program and does not demonstrate any inadequacy with the Java language or the API.
My friend and I will discuss this and figure out how to make the problem solvable. This really is a friendly challenge and neither of us is looking to make this into a major consumer of our time. I expect that we will either restrict the files to particular formats or insist that the type of file be supplied as an input parameter.
-- Rhino
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|