Hello,
I am having a problem when inputting very long strings into a database.
The application I am writing can use different databases (thanks to
the wonders of JDBC) so this issue has been causing problems on both
Oracle and SQL Server.
Because one of the design objects was to support any JDBC compatible
database, a concern was raised about text widths. It was therefore
decided that the maximum column width for a VARCHAR would be a
configurable value. We theoretically knew that data could be more than
a single line so we introduced a sequence number to allow multiple
rows. (Don't ask me why we didn't use CLOBs instead, this is the
schema I'm stuck with.)
We now need to store base64 data in the same fields. The problem is
that in an example 4000 characters as defined by the Java string
object, its physical size is approximently 4430. This seems to be
because of the amount of mark-up involved, either in the base64 data or
possibly with the text between.
It occurs to me that while a non-ASCII value many be only a single
character in a unicode string, it is 6 characters in UTF-8. Therefore
I'm looking for a way of calculates the absolute length, rather than a
count of characters.
Is this possible or will I have to change the schema?
Hybris - 09 Jan 2007 13:36 GMT
Il Tue, 09 Jan 2007 04:34:45 -0800, james.w.appleby ha scritto:
> I'm looking for a way of calculates the absolute length, rather than a
> count of characters.
see String method getBytes
Ian Wilson - 10 Jan 2007 10:38 GMT
> It occurs to me that while a non-ASCII value many be only a single
> character in a unicode string,
I think you mean the opposite, that an ASCII (not non-ASCII) character
will be represented in UTF-8 using a single *byte*.
> it is 6 characters in UTF-8.
No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
character.
> Therefore
> I'm looking for a way of calculates the absolute length, rather than a
> count of characters.
String has a getBytes() method for this purpose.
Oliver Wong - 10 Jan 2007 22:01 GMT
>> It occurs to me that while a non-ASCII value many be only a single
>> character in a unicode string,
[quoted text clipped - 6 lines]
> No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
> character.
And even then, UTF-8 only ranges from 1 to 4 octects. The values start
at 0x000000 and go to 0x10FFFF.
- Oliver
John W. Kennedy - 11 Jan 2007 00:07 GMT
>>> It occurs to me that while a non-ASCII value many be only a single
>>> character in a unicode string,
[quoted text clipped - 7 lines]
> And even then, UTF-8 only ranges from 1 to 4 octects. The values start
> at 0x000000 and go to 0x10FFFF.
CESU-8 and Java's "Modified UTF-8" use as many as six, because they
first encode characters above U+FFFF as UTF-16, and then UTF-8 encode
the result. "UTF-8", albeit wrongly, is often taken to include one or
both of those schemes, so the incorrect figure of 6 is often encountered.

Signature
John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
-- Charles Williams. "Taliessin through Logres: Prelude"
Manfred Rosenboom - 10 Jan 2007 15:37 GMT
Hi James,
Maybe the following Sun Tech Tip is worth reading by you:
Tech Tip #1: How long is your String object?
http://java.sun.com/mailers/techtips/corejava/2006/tt0822.html#1
Best,
Manfred