Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / April 2006

Tip: Looking for answers? Try searching our database.

Strings and bindary data

Thread view: 
aaronfude@gmail.com - 20 Apr 2006 02:39 GMT
Hi,

Are strings designed to hold binary data? For example, can I read an
arbitrary finle into a String and then print (or, I guess, write?) the
String to another file, are those files guaranteed to be identical?

On a somewhat related subject, what is an "encoding"? Meaning, when
does enter into consideration. I always thought of files as just
collection of bytes.

Thanks!

Aaron Fude
David Wahler - 20 Apr 2006 04:18 GMT
> Are strings designed to hold binary data? For example, can I read an
> arbitrary finle into a String and then print (or, I guess, write?) the
[quoted text clipped - 3 lines]
> does enter into consideration. I always thought of files as just
> collection of bytes.

In Java, a string is a sequence of characters while a file is a
sequence of bytes. Make sure you know what type of data you're dealing
with at any given time, because characters and bytes do not have a
one-to-one correspondence; when saving textual data to a file, you have
to choose an encoding to translate characters to bytes, and likewise
for reading from a file. The InputStreamReader and OutputStreamWriter
classes can generally handle the encoding and decoding for you.

Joel Spolsky's article on Unicode is a good introduction to these
concepts:
<http://www.joelonsoftware.com/articles/Unicode.html>

-- David
Shin - 20 Apr 2006 04:46 GMT
> Hi,
>
[quoted text clipped - 5 lines]
> does enter into consideration. I always thought of files as just
> collection of bytes.

Eeverything boils down to 1 or 0 in computer, in that sense, strings do
hold "binary data".  But I guess you are asking a different question.
If you read the Java API documentation for class "String" and
"Character", you will see whatever a "String" holds is in so-called
UTF-16 encoding.  That is to say, for most characters in the string, it
would take 16-bit to represent.  For other characters, like that in
Chinese, you need two 16-bit "char" to represent.   Othre encoding
schemes use different bits pattern to represent characters, for
example, with the most common ASCII encoding,  an english letter is
represented by 8 bits (strictly 7 bits).

When you normally read a file, it's read as a so-called "byte stream",
i.e., a 0 bit in the file will be read in as a 0 bit,  usually they are
read in at a bigger unit, for example one "byte" or one "int" a time.
You can read all data in the file faithfully into a byte array or an
int arry, whatever you deem as convenient to use later.   If the file
is a text file,  you might want to read it into Strings, in that case,
it's read as a so-called "character stream" and you have to tell Java
what encoding scheme the file use if it cannot fiugre out
automatically, i.e., how characters are represented as bits in the
file.  Since Java "String" always use UTF-16 encoding, a conversion
might happen automatically in the process.   See Java API doc for
"InputStreamReader".

Hope this answers some of your questions.

-Shin
Roedy Green - 20 Apr 2006 06:37 GMT
>Are strings designed to hold binary data

no.  See http://mindprod.com/jgloss/encoding.html
also http://mindprod.com/jgloss/binary.html
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 20 Apr 2006 13:05 GMT
> Are strings designed to hold binary data? For example, can I read an
> arbitrary finle into a String and then print (or, I guess, write?) the
> String to another file, are those files guaranteed to be identical?

No.  Strings are designed to hold textual data, and that /always/ is subject to
some form of transformation (possibly an identity transformation) when it is
converted to or from binary.

You can use instances of java.lang.String (or char[]) to hold arbitrary
unsigned 16-bit data, but you should be careful not to let the "system" try to
interpret it as text (which it will do if you try to write a String out).  In
general, unless you happen to have a specific need for unsigned 16-bit
quantities, it's better to stick with a pure binary representation (such as a
byte[] array) in such cases.

> On a somewhat related subject, what is an "encoding"? Meaning, when
> does enter into consideration. I always thought of files as just
> collection of bytes.

Files /are/ just collections of bytes ;-)

But Strings, char[] arrays, and even char values, are not.  They are
representations of textual data.  The textual data has a meaning above and
beyond what is in its representation.  Think of it like this, say we start with
a word:
   Snark

If we want to manipulate that in a Java program, then we'll probably use a
String object:
   "Snark"

(Internally that is represented in the computer's memory by a sequence of
unsigned 16-bit integer values:
   0x0053 0x006E 0x0061 0x0072 0x006B
but that is not important to you for most purposes -- what matters are the
characters in the string, not how they are represented physically.)

Now suppose we want to put that word, Snark, into a file.  It's an abstraction,
not something that's made of bits and bytes, so we have to /represent/ it as
such before we can put those bytes into a file.  This applies especially in
Java (which distinguishes between Sting and binary data better than many
languages).  When you put the word into a file you have to choose a
representation -- that's to say a mapping from abstract characters (or
whatever) to actual bytes.  One widely used representation is ASCII which
assigns byte values to a small set of the characters used in English.
Another, which covers roughly the same range, but uses different numerical
values is EBCDIC.  These mappings are called character encodings, character
sets (charsets), or sometimes "code pages".  The big daddy of character
encodings is Unicode (of which more below).

Let's say that we want to represent the word Snark as ASCII in a file.  The
corresponding bytes would be:
   0x53 0x6E 0x61 0x72 006B
If we wanted to use EDBCID then the bytes would be different (I can't be
bothered to look up what they would be).  If we wanted to the Unicode format
called UTF-16, then we have two variants, one using Intel byte order
(little-endian):
   0x53 0x00 0x6E 0x00 0x61 0x00 0x72 0x00 0x6B 0x00
the other using "network byte order" (big-endian)
   0x00 0x53 0x00 0x6E 0x00 0x61 0x00 0x72 0x00 0x6B

The "Snark" example doesn't really show what's going on.  So let's rename
Snarks:
   Snørk

(On this machine that's using an o-with-a-slash-through-it instead of the 'a'.
I hope that's how it looks where you are reading it, if not then just pretend
it does...)

The corresponding Java String object would contain the integer values:
   0x0053 0x006E 0x00F8 0x0072 0x006B 0x00A9

Now if I want to write our new word to a file, then I may have a problem.  I
can't use the ASCII representation, because ASCII doesn't have a mapping for
the slashed-o character!  So I have to use a different mapping.  My machine is
set up, as it happens, to use a mapping called 'windows-1252' (which is one of
the Microsoft code pages), it is almost identical to the ISO charset called
ISO-8859-1.  In either of those our word would be represented as:
   0x53 0x6E 0xF8 0x72 0x6B 0xA9

(The similarity to Java's internal representation is mostly just coincidental.)
But if I were using a different machine, one which used a different
representation by default, then the word would be represented by different
bytes.  For instance, if this machine were set up to expect me to be writing
Polish (Windows code page 1250, or ISO-8859-2) then I'd be in trouble again,
because neither of those code pages have a representation of that character
(they use the number 0xF8 to represent the letter r-with-a-caron instead).  So
if someone in Poland attempted to read the file that I wrote in code page
windows-1252, then they wouldn't see the right characters if their machine was
using the Windows-1250 encoding.

So there are two problems with code pages and charsets generally.  One is that
they don't contain the same characters, and the other is that they may not map
the same characters to the same numbers.  That's why you always have to specify
a charset when you are converting between binary data and textual data in Java
(even if you don't specify one explicitly, the system will be using a
default -- which might or might not be correct).

This is where Unicode comes in.  It provides a fixed mapping that is supposed
to be complete (for some given meaning of "complete") and universal.  So the
problems of knowing which charset to use just go away.  But there are still two
problems: one is that not everybody uses Unicode, so you will very certainly
have to deal with text files containing ISO-8858-1 data (for instance), as well
as nice reliable Unicode -- in fact they are so common that Unicode can't even
be made the default :-(   The second problem is that there are a /lot/ of
Unicode characters defined, too many to fit into 8-bits, or even into 16.  So
Unicode defines several physical representations of the abstract numbers, which
have various tradeoffs between complexity and space.  In the physical encoding
known as "UTF-8", for instance, which attempts to provide a compact
representation of mostly English text, our word would be written to file as:
   0x53 0x6E 0xC3 0xB8 0x72 0x6B

In the encoding known as UTF-16, which is optimised for text which mostly needs
16-bits per character, there are two variants, big-endian and little-endian.
The little endian representation is:
   0x53 0x00 0x6E 0x00 0xF8 0x00 0x72 0x00 0x6B 0x00

(BTW.  The variations I have shown are all quite similar -- that's because most
character encodings tend to be similar for English characters.  The further
away from English you get, the more the various encodings diverge.)

Finally we come to the bottom line.  Even with Unicode, you always have to have
a mapping between text and binary.  If you get the mapping wrong then you are
in trouble.  Don't, if you can possibly help it, manipulate binary data as
text, or textual data as binary.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.