Java Forum / Databases / February 2008
Encoding conversion problem
Andrea - 11 Feb 2008 12:03 GMT Hi, I have a J2EE application which connects to a DB2 configured with code set IBM-850. The application works with encoding ISO-8859-1. If I save characters outside the range supported by IBM-850 (i.e. the euro currency character EURO) then I read garbage...
I tried encoding conversions with InputStreamReader and OutputStreamWriter: ... BufferedReader reader = new BufferedReader(new InputStreamReader(source, "IBM850")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output, "ISO-8859-1")); ...
but that didn't work... My JVM Charset.availableCharsets() includes IBM850.
What can I do?
Thanks, in advance, Andrea
Lothar Kimmeringer - 12 Feb 2008 07:13 GMT > I have a J2EE application which connects to a DB2 configured with code > set IBM-850. The application works with encoding ISO-8859-1. In general the JDBC-driver is aware of the encoding, the database is using and is doing the conversion already if you access the column by getString(columnName/index).
> I tried encoding conversions with InputStreamReader and > OutputStreamWriter: > ... > BufferedReader reader = new BufferedReader(new > InputStreamReader(source, "IBM850")); What is source? How do you create that from the JDBC- resultset?
> BufferedWriter writer = new BufferedWriter(new > OutputStreamWriter(output, "ISO-8859-1")); That looks OK.
Regards, Lothar
 Signature Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong questions!
Andrea - 12 Feb 2008 08:25 GMT Hi Lothar,
> > I have a J2EE application which connects to a DB2 configured with code > > set IBM-850. The application works with encoding ISO-8859-1. > > In general the JDBC-driver is aware of the encoding, the database > is using and is doing the conversion already if you access the > column by getString(columnName/index). Yes I fetch the string with Resultset.getString(index). I use DB2 Universal Driver with a type 4 connection.
> > I tried encoding conversions with InputStreamReader and OutputStreamWriter: > > ... [quoted text clipped - 3 lines] > What is source? How do you create that from the JDBC- > resultset? I tried: InputStream source = new ByteArrayInputStream(stringFetchedFromDB.getBytes());
Thanks, Andrea
Lothar Kimmeringer - 14 Feb 2008 19:11 GMT >> What is source? How do you create that from the JDBC- >> resultset? > > I tried: > InputStream source = new > ByteArrayInputStream(stringFetchedFromDB.getBytes()); getBytes() uses the system-encoding for generating the byte-array. Why do you generate an InputStream anyway?
What you want to do is OutputStreamWriter osw = new OutputStreamWriter( output, "8859_1"); osw.write(resultset.getString("mycolumn"));
BTW: The Euro is not part of ISO-8859-1, so it will get lost that way anyway.
Regards, Lothar
 Signature Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong questions!
Sabine Dinis Blochberger - 12 Feb 2008 09:33 GMT > Hi, > I have a J2EE application which connects to a DB2 configured with code > set IBM-850. The application works with encoding ISO-8859-1. > If I save characters outside the range supported by IBM-850 (i.e. the > euro currency character EURO) then I read garbage... Yes, the Euro symbol is not part of the encodings, so your database can't contain it. If you need it, you would have to change the databases encoding (ISO-8859-15 includes the Euro symbol).
Otherwise, you have to take care not to try to write unsupported character into string/character fields.
One solution could be to parse all strings and replace the symbol with the shorthand "EUR", but it might not be acceptable to your client.
 Signature Sabine Dinis Blochberger
Op3racional www.op3racional.eu
Andrea - 12 Feb 2008 11:22 GMT > > ... > > If I save characters outside the range supported by IBM-850 (i.e. the > > euro currency character EURO) then I read garbage... > > Yes, the Euro symbol is not part of the encodings, so your database > can't contain it. I've found a strange thing: C and COBOL application can write and read (using embedded SQL) characters outside the accepted range without problems... So the database can contain those characters without loosing any information, but I can't understand how...
> If you need it, you would have to change the databases > encoding (ISO-8859-15 includes the Euro symbol). [quoted text clipped - 3 lines] > One solution could be to parse all strings and replace the symbol with > the shorthand "EUR", but it might not be acceptable to your client. Actually the EURO character is just an example, I have more complex strings to handle (and I can't change the encoding of the database). If my problem has no solution at all then I'd like to understand why other languages don't have this problem...
Thanks, Andrea
Sabine Dinis Blochberger - 12 Feb 2008 13:02 GMT > > > ... > > > If I save characters outside the range supported by IBM-850 (i.e. the [quoted text clipped - 6 lines] > problems... So the database can contain those characters without > loosing any information, but I can't understand how... Yes, in theory you can store any value (0 - 255 in case of one byte strings) in a string, but how that is interpreted (i.e. encoding) is where it gets hairy. Also, multibyte characters would break the interpretation.
> > If you need it, you would have to change the databases > > encoding (ISO-8859-15 includes the Euro symbol). [quoted text clipped - 7 lines] > If my problem has no solution at all then I'd like to understand why > other languages don't have this problem... Ah, there is always hacks around limitations. But they aren't usually pretty. The problem is to funnel a string with these "unsupported" characters through the JDBC driver (both ways).
You might get around it by using typeless fields (you can put any byte sequence there), like BLOBS maybe...
Or you write a parser that substitutes the impossible characters with acceptable replacements. Of course, this is most likele not feasable.
But the customer has to be aware that a database with encoding X can only hold strings encoded in X. If they need UTF-8 for example now, they will eventually have to change their database. And it would be better to migrate to a suitable encoding than to hack around it and in a few years, have to do all over again (and then some), when they finally do want to change the database encoding.
On other languages not having the problem, in C, you can treat a string just like an array of bytes and use those for whatever you like, the compiler won't complain. Even interpreting them as memory addresses is possible, adding and subtracting etc...
> Thanks, > Andrea
 Signature Sabine Dinis Blochberger
Op3racional www.op3racional.eu
Andrea - 12 Feb 2008 14:33 GMT Hi Sabine, thank you for your explanation, now the overall situation is much more clear to me.
Thanks, Andrea
Roedy Green - 12 Feb 2008 18:07 GMT On Mon, 11 Feb 2008 04:03:47 -0800 (PST), Andrea <tol7481@iperbole.bologna.it> wrote, quoted or indirectly quoted someone who said :
>I have a J2EE application which connects to a DB2 configured with code >set IBM-850. The application works with encoding ISO-8859-1. >If I save characters outside the range supported by IBM-850 (i.e. the >euro currency character EURO) then I read garbage... First, make sure the data are truly encoded in IBM-850. See http://mindprod.com/applet/encodingrecogniser.html
If there are characters int that file outside the range of IBM-850, then by definition the file is not encoded in IBM-850 and you SHOULD expect garbage.
You can write your own translate program to handle the excess chars.
see http://mindprod.com/jgloss/encoding.html
I don't know how to hook it in as an official encoding, but that is not necessary. --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 12 Feb 2008 18:10 GMT On Mon, 11 Feb 2008 04:03:47 -0800 (PST), Andrea <tol7481@iperbole.bologna.it> wrote, quoted or indirectly quoted someone who said :
>BufferedReader reader = new BufferedReader(new >InputStreamReader(source, "IBM850")); >BufferedWriter writer = new BufferedWriter(new >OutputStreamWriter(output, "ISO-8859-1")); Your first task is to find out just what you are being handed before you start fooling around with translations.
Unicode, IBM850, ISO-8859-1, something else? --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Andrea - 13 Feb 2008 11:22 GMT Hi Roedy, the database (DB2) has this configuration: ... Database territory = US Database code page = 850 Database code set = IBM-850 ...
I've exported to a file the content of a table with a CHAR(N) field containing the EURO currency character, then I've opened the file with EncodingRecognizer: if I choose IBM850 I see a strange character (like a small X), if I choose ISO-8859-1 I see a square.
I tried a translation with:
String problematicString = rs.getString(index); problematicString = new String(problematicString, "IBM850"); // Am I correct?
but I still get garbage :-(
Thanks, Andrea
Silvio Bierman - 13 Feb 2008 11:36 GMT > Hi Roedy, > the database (DB2) has this configuration: [quoted text clipped - 19 lines] > Thanks, > Andrea Those are quite interesting database configuration parameters. Sounds like a pre-Unicode setup to me...
Silvio Bierman
Silvio Bierman - 13 Feb 2008 11:48 GMT > Hi Roedy, > the database (DB2) has this configuration: [quoted text clipped - 19 lines] > Thanks, > Andrea You should look at the numeric byte values in problematicString. That could give you an idea of what you are dealing with although it will only disclose what your JDBC driver has made of it. It might have already done an incorrect interpretation of a byte sequence. Things could be even worse, the data could already have been mutilated during insertion in the database when some program (possibly the same program + JDBC driver?) put the data in. If the database encoding does not support all characters that where in the original data then that is what most likely happened.
Really, as I said in my previous post you should consider (a) going to a different database that supports Unicode (b) refrain from using/supporting non ASCII characters in your application or (c) do what others have suggested and do your own translation from Unicode -> ASCII -> DB -> ASCII -> Unicode. The latter option is only realistic if you have wrapped all your JDBC code in some generic wrappers (which is usually a good idea) so you can handle this locally.
Good luck,
Silvio
Andrea - 13 Feb 2008 14:22 GMT Hi Silvio, the settings are taken from the DB2 instance of a customer (and I can't change them). The very same code works, of course, without problems with a DB2 instance configured with ISO-8859-1.
The problem arises also when a C program stores to DB a string with non-IBM850 valid characters: another C program can read the string without problems while Java can't; so the string is not corrupted when saved to DB but someone (JDBC driver? Java I/O?) looses something when I read the field with Resultset.getString(int index) and I can't convert it correctly (or I haven't found the right way to do it yet, if it exists...).
BTW: a test has been made on DB2 with a table with a field declared CHAR(n) FOR BIT DATA and Java code works without problems reading and writing non-IBM850 characters.
Having read your feedback (THANK YOU EVERYONE!) I would say that there's no way to read back those characters in Java in my application.
Thanks again, Andrea
Silvio Bierman - 13 Feb 2008 15:39 GMT Hello Andrea,
> Hi Silvio, > the settings are taken from the DB2 instance of a customer (and I > can't change them). The very same code works, of course, without > problems with a DB2 instance configured with ISO-8859-1. I already expected you could not change this but it was worth suggesting...
I do not really understand why a Euro sign would work with 8859-1 since that does not contain that character as far as I am aware of.
> The problem arises also when a C program stores to DB a string with > non-IBM850 valid characters: another C program can read the string [quoted text clipped - 3 lines] > convert it correctly (or I haven't found the right way to do it yet, > if it exists...). It will probably depend on how you access the DB from C (which I have never done with DB2) but it does not sound surprising that the C binding would just pass through byte sequences. As long as you stay inside 8-bit character encoding and always interpret them in the same codepage then this usually works (which is why many programmers are totally unaware of character encoding issues).
Since Java programs handle strings as sequences of characters from the Unicode character set all interfaces with external character storage needs to be done encoding aware. This means that when doing input Java String objects are created from byte sequences ALWAYS assuming an encoding, if not explicitly specified defaulting to the platform default. The other way around String objects are converted to byte sequences when doing output as well ALWAYS using an encoding, again defaulting to the platform default.
> BTW: a test has been made on DB2 with a table with a field declared > CHAR(n) FOR BIT DATA and Java code works without problems reading and > writing non-IBM850 characters. That makes sense. What I said before means that if you do not specify different encodings the same default will be used for input and output. If the storage medium leaves the intermediate bytes alone (which any binary database column type would do, just like a binary file would) then output will again match input, again allowing encoding unaware programmers to get away with it.
> Having read your feedback (THANK YOU EVERYONE!) I would say that > there's no way to read back those characters in Java in my > application. The problem is probably that the default encoding in Java turns your Unicode character into something the database is unwilling to store as is in this encoding and therefore mutilates trying to fix something. When queried the database then returns a byte sequence different from what was put in. After that modification the JDBC driver has no chance of restoring the original value.
> Thanks again, > Andrea Best regards,
Silvio
Roedy Green - 13 Feb 2008 16:38 GMT On Wed, 13 Feb 2008 16:39:59 +0100, Silvio Bierman <sbierman@jambo-software.com> wrote, quoted or indirectly quoted someone who said :
>I do not really understand why a Euro sign would work with 8859-1 since >that does not contain that character as far as I am aware of. You could do an experiment. Try feeding your database all possible unicode chars in a set of 1-char records, and see which ones come back unmangled. This is a kludge, but you could preconvert your Euro to one of those invariant unused chars. --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Andrea - 14 Feb 2008 16:00 GMT Hi everyone, sorry for my previous double-post (a mistake).
>Is is possible to ask the database driver to do the conversions for >you? Perhaps internally it is Unicode or some other encoding that can >deal with Euros. I've checked the properties of the JDBC driver I use (http:// publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/ com.ibm.db2.udb.doc/ad/rjvdsprp.htm) but there's nothing concerning encoding conversions.
>We have the clue that C++ programs seem to store euro s and get them back out. Yes we have C and COBOL programs that can store and write non-IBM850 chars without problems too. As pointed out by Sabine in her post the reason may be that C programs work with the pure sequences of bytes, without performing any encoding conversion.
>>I do not really understand why a Euro sign would work with 8859-1 since >>that does not contain that character as far as I am aware of. SORRY SORRY SORRY SORRY SORRY I tried to insert (through JDBC) the EURO character in a DB2 configured with ... Database territory = C Database code page = 819 Database code set = ISO8859-1 ... and I can't neither write nor read in Java the EURO character correctly :-( A COBOL program works instead correctly.
Then I tried the same thing on a SQL-Server 2000 instance with collation compatibility_51_409_30003 (correponding to a 1252 codepage, i.e. Latin 1) and I can store and read the EURO character via Java&JDBC.
That doesn't work in Java with Oracle 10g configured with ... NLS_LANGUAGE = AMERICAN NLS_TERRITORY = AMERICA NLS_CHARACTERSET = US7ASCII NLS_LENGTH_SEMANTICS = BYTE ... store&read through COBOL is ok, and in Java I can even write&read accented vowels... even if those characters are outside USASCII7...
>You could do an experiment. Try feeding your database all possible >unicode chars in a set of 1-char records, and see which ones come back >unmangled. This is a kludge, but you could preconvert your Euro to >one of those invariant unused chars. The EURO character is just an example and part of the problem, I can't use this type of kludges. The specific problem is much more complex: a password is crypted and stored to DB with a C program but the crypted chars fall outside IBM850 range and in Java I'm unable to read and decrypt back the string... this works if the database is ISO-8859-1 (that's why I though I were able to write another 'weird' char, the euro char, on an ISO-8859-1 DB, sorry...). I've also the more general problem of data entry: I don't know wich characters users will insert so I can't substitute chars. I've found a workaround for my crypting problem but I'm just trying to understand the reason of the problem.
Now it's clear to me that with a CHAR field Java performs an encoding conversion using the encodings of the JVM and of the DBMS: if some characters fall outside the destination encoding then they are lost (i.e. converted in something completely different). The only 'mysterious' thing for me now is the behavior on Oracle (JDBC can read&write accented vowels even if they are outside ascii7)... any idea? Maybe the Oracle driver is smarter than the DB2 Universal Driver...
Thanks everyone, Andrea
Silvio Bierman - 14 Feb 2008 20:44 GMT > Hi everyone, > sorry for my previous double-post (a mistake). [quoted text clipped - 72 lines] > Thanks everyone, > Andrea Hello Andrea,
Even if you set a database encoding to ASCII it is very unlikely that the DB will strip non-ASCII characters. Actually, most databases treat every byte-size (ie 8-bit) encoding almost identically internally. They may sometimes have different default collations but that is about it. The codepage attribute is mostly important for programs interfacing with the DB. As most of those (especially older ones) are encoding unaware also bytes pass in and out inharmed. In the end all 8-bit encodings are equal until actually interpreted to represent characters, aren't they?
I have seen application running on cp-1252 platforms using 8859-1 encoded databases for years without anyone noticing. Same for cp-1257 on a cp-1252 database. Nobody realy cares when the same data that was put in comes out again.
This is not unlike SMTP which is supposed to be 7-bit only but since the transport encoding passes 8-bit characters freely people are used to sending non-ascii characters in plain-text emails although this is not supported. This all works great until someone from Lithuania sends me an email (I am in the Netherlands).
Regards,
Silvio
Andrea - 15 Feb 2008 11:26 GMT Hi Lothar,
>> I tried: >> InputStream source = new >> ByteArrayInputStream(stringFetchedFromDB.getBytes()); >getBytes() uses the system-encoding for generating the >byte-array. Why do you generate an InputStream anyway? I was just a "desperate" attempt... :-) I tried almost everything...
>What you want to do is >OutputStreamWriter osw = new OutputStreamWriter(output, "8859_1"); >osw.write(resultset.getString("mycolumn")); doesn't work...
Hi Silvio,
>Hello Andrea, >Even if you set a database encoding to ASCII it is very unlikely that >the DB will strip non-ASCII characters. > .... Yes now this is clear to me, thanks!
I was thinking only about the DB encoding while the problem is mainly in the JVM encoding (now it's clear to me that Java can't handle characters outside the encoding of the JVM, I wasn't thinking about it, sorry...).
I've made another test: I've exported the content of the table with the crypted password and I've found that the password I can't decrypt back contains characters between 0x80 and 0x9F, which are control characters in ISO-8859-1 and Java - reads garbage with DB2 configured with IBM850 and JVM ISO-8859-1 - reads correctly with both DB2 and JVM configured with ISO-8859-1
I understand the first behavior but the last point is strange... Java (with some "magic" :-) is doing the proper conversion if the db is iso-8859-1 but I can't understand how... I will test it again and let you know if I find something.
Thanks again everyone!
Andrea
Sabine Dinis Blochberger - 15 Feb 2008 12:02 GMT > I've made another test: I've exported the content of the table with > the crypted password and I've found that the password I can't decrypt [quoted text clipped - 11 lines] > > Andrea I'm guessing, but maybe if the databse tells the JDBC driver it's ISO-8859-1 *and* your application tells it the same encoding, it won't bother trying to transform anything...
 Signature Sabine Dinis Blochberger
Op3racional www.op3racional.eu
Lew - 15 Feb 2008 13:10 GMT > I was thinking only about the DB encoding while the problem is mainly > in the JVM encoding (now it's clear to me that Java can't handle > characters outside the encoding of the JVM, I wasn't thinking about > it, sorry...). "The encoding of the JVM" is Unicode-16 with surrogate pairs; every Unicode character is representable in the JVM, including the Euro character. There is no Unicode character that the JVM cannot represent.
 Signature Lew
Andrea - 15 Feb 2008 14:51 GMT Hi Sabine,
>I'm guessing, but maybe if the databse tells the JDBC driver it's >ISO-8859-1 *and* your application tells it the same encoding, it won't >bother trying to transform anything... Yes that's what I was thinking too... but I tried to change the encoding of the JVM (tried Cp850, ...) but it keeps on working...
Hi Lew,
>> I was thinking only about the DB encoding while the problem is mainly >> in the JVM encoding (now it's clear to me that Java can't handle [quoted text clipped - 3 lines] >character is representable in the JVM, including the Euro character. There is >no Unicode character that the JVM cannot represent. With "encoding of the JVM" I was referring to the file.encoding property used by the JVM. If the JVM runs with: - ISO-8859-1 then I can't read or write the EURO character to DB (it becomes garbage) and ISO-8859-1 doesn't include that character; - Cp1252 then I can read and write the EURO character to DB and Cp1252 includes that character.
Andrea
Lew - 15 Feb 2008 15:00 GMT > Hi Sabine, >> I'm guessing, but maybe if the databse tells the JDBC driver it's [quoted text clipped - 17 lines] > - Cp1252 then I can read and write the EURO character to DB and Cp1252 > includes that character. I uinderstand my confusion now - it stemmed from the phrase "the encoding of the JVM". The JVM itself only uses one encoding; it translates to and from other encoding on I/O. So to make sure I understood you correctly, were you referring to the encoding specified by the I/O call?
Generally if the encoding you specify for I/O is different from the encoding in your data store, it will cause trouble. This is not limited to Java. Over in the Postgres newsgroups one finds people have trouble with character encoding from all sorts of platforms, mostly stemming from trying to store characters in a column that are not part of the specified character encoding for the DB. If such things don't match, then problems will hatch.
 Signature Lew
Andrea - 15 Feb 2008 17:02 GMT Hi Lew,
> I uinderstand my confusion now - it stemmed from the phrase "the encoding of > the JVM". The JVM itself only uses one encoding; it translates to and from > other encoding on I/O. So to make sure I understood you correctly, were you > referring to the encoding specified by the I/O call? Maybe I don't understand :-( In my posts I tried to specify the encoding of the DBMS and the JVM encoding (i.e. the system property file.encoding) in the different cases and, as you stated, the JVM performs the necessary translations. In my JDBC calls I don't specify/force any encoding.
> Generally if the encoding you specify for I/O is different from the encoding > in your data store, it will cause trouble. This is not limited to Java. Over > in the Postgres newsgroups one finds people have trouble with character > encoding from all sorts of platforms, mostly stemming from trying to store > characters in a column that are not part of the specified character encoding > for the DB. If such things don't match, then problems will hatch. I disagree... For our application we keep the DBMS with a fixed encoding and the application performs the necessary conversions. For instance for polish installations we use an ISO-8859-1 database and an application server configured with ISO-8859-2 where we store polish characters without problems.
Andrea
Silvio Bierman - 15 Feb 2008 17:16 GMT > Hi Lew, > [quoted text clipped - 22 lines] > > Andrea Hello Andrea,
Using an incomplete database encoding that does not match the application is a dangerous practice that only works by a coincidence. It is just because the encodings you mention are 8-bit complete that you can pass in and out 8-bit values untouched, even though they really can mean something different inside the DB from how you interpret them in your application. With little exception it is never a good idea to use a database encoding other then a complete encoding like UTF-8 (to be honest you should always use this in a DB if you are left the choice). If the applications are Java based you will have a perfect match that way and only when you do stuff like generating emails, plain-text files or Office documents (shivers) where incomplete encodings (code pages) come into the game you have to take extra measures. Using the platform encoding then is usually a good idea unless it is a server application in which case you have to make an educated guess.
Regards,
Silvio
Roedy Green - 13 Feb 2008 16:36 GMT On Wed, 13 Feb 2008 06:22:22 -0800 (PST), Andrea <tol7481@iperbole.bologna.it> wrote, quoted or indirectly quoted someone who said :
>Hi Silvio, >the settings are taken from the DB2 instance of a customer (and I >can't change them). The very same code works, of course, without >problems with a DB2 instance configured with ISO-8859-1. Is is possible to ask the database driver to do the conversions for you? Perhaps internally it is Unicode or some other encoding that can deal with Euros. We have the clue that C++ programs seem to store euro s and get them back out.
I am puzzled. I thought JDBC always talked to you in Unicode. --
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Andrea - 13 Feb 2008 15:28 GMT Hi Silvio, the settings are taken from the DB2 instance of a customer (and I can't change them). The very same code works, of course, without problems with a DB2 instance configured with ISO-8859-1.
The problem arises also when a C program stores to DB a string with non-IBM850 valid characters: another C program can read the string without problems while Java can't; so the string is not corrupted when saved to DB but someone (JDBC driver? Java I/O?) looses something when I read the field with Resultset.getString(int index) and I can't convert it correctly (or I haven't found the right way to do it yet, if it exists...).
A test has been made on DB2 with a table with a field declared CHAR(n) FOR BIT DATA and Java code works without problems reading and writing non-IBM850 characters.
Having read your feedback (THANK YOU EVERYONE!) I would say that there's no way to read back those characters in Java in my application.
Thanks again, Andrea
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|