I am using a java program to read html files in it's encoding format.
One of the html files have the following using UTF-8 encoding
<head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
..........................
<tr>Should ‘I'm Sorry’ Let Officials off the Hook?</tr>
When I read the file using encoding UTF-8 inside java I get back teh
same
Should ‘I'm Sorry’ Let Officials off the Hook?
Why isn,t it converted to
Should ‘I'm Sorry’ Let Officials off the Hook? as displayed in browser
Thanks for any help rendered.
learnyourabc schreef:
> I am using a java program to read html files in it's encoding format.
> One of the html files have the following using UTF-8 encoding
[quoted text clipped - 8 lines]
> Why isn,t it converted to
> Should ‘I'm Sorry’ Let Officials off the Hook? as displayed in browser
The ‘ are interpreted as UTF-8, that is, as themselves. It is
*before* you get the file you are trying to read, where things go wrong.
I.e., though the file claims to be in UTF-8, it actually isn’t. Or
rather it is, but a bad editor was used such that some wrong symbols got
out when the file was saved. Check out where the file comes from.
H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
rossum - 14 Jun 2007 12:38 GMT
>The â are interpreted as UTF-8, that is, as themselves. It is
>*before* you get the file you are trying to read, where things go wrong.
[quoted text clipped - 3 lines]
>
>H.
I have seen this before in MS Word documents that use "smart-quotes".
Given the positioning of the strange characters, I would suspect that
what is claimed to be UTF-8 is actually MS Word not-quite-UTF-8. As
Hendrik says, you probably need to talk to the person who created the
original.
rossum
Lew - 14 Jun 2007 15:01 GMT
>> The ‘ are interpreted as UTF-8, that is, as themselves. It is
>> *before* you get the file you are trying to read, where things go wrong.
[quoted text clipped - 8 lines]
> Hendrik says, you probably need to talk to the person who created the
> original.
If those symbols really represented quote marks they wouldn't show up as ‘
on the newsgroup. The fact that they show up as ‘ is evidence that they are
valid UTF-8 characters, given that your message was encoded that way to the
group. This supports the conclusion the others have reached.

Signature
Lew
On Wed, 13 Jun 2007 22:04:45 -0700, learnyourabc
<learnyourabc@yahoo.com> wrote, quoted or indirectly quoted someone
who said :
>I am using a java program to read html files in it's encoding format.
>One of the html files have the following using UTF-8 encoding
[quoted text clipped - 9 lines]
>Should I'm Sorry Let Officials off the Hook? as displayed in browser
>Thanks for any help rendered.
First, examine the file with a hex editor to see if it is indeed
UTF-8.
See http://mindprod.com/jgloss/encoding.html
http://mindprod.com/jgloss/hex.html
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/bom.html
You might write one-shot Java program to produce that same message in
UTF-8 and compare it byte for byte.
If you don't know how to write such a program, ask the File IO
Amanuensis to generate the code for you. See
http://mindprod.com/applets/fileio.html
It will get it 90% there. You will have to do a few little adjustments
to select your encoding and text message.
View the generated file in your browser to make sure your browser is
smart enough to understand UTF-8 encoding. I use Opera which does.
see http://mindprod.com/jgloss/opera.html
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Christian - 15 Jun 2007 11:27 GMT
Roedy Green schrieb:
> On Wed, 13 Jun 2007 22:04:45 -0700, learnyourabc
> <learnyourabc@yahoo.com> wrote, quoted or indirectly quoted someone
> who said :
> http://mindprod.com/applets/fileio.html
> It will get it 90% there. You will have to do a few little adjustments
> to select your encoding and text message.
thats a very bad example for his case..
DataInputStream does not provide utf-8 it only has java's modified utf-8
which is not utf-8 just something you can use between java apps if you
want to support international characters.. else imho java's utf-8 is as
useless as §)($&/()§grr ..
if yu want utf-8 in java.. you either have to use the string constructor
or use the CharsetDecoder/Encoder from java.nio.*
Christian
Roedy Green - 16 Jun 2007 00:48 GMT
>DataInputStream does not provide utf-8 it only has java's modified utf-8
>which is not utf-8 just something you can use between java apps if you
[quoted text clipped - 3 lines]
>if yu want utf-8 in java.. you either have to use the string constructor
>or use the CharsetDecoder/Encoder from java.nio.*
You can also use a Reader with UTF-8. Here is the code from the File
I/O Amanuensis.
// Read Locale-encoded chars ( usually 8 bit ) from a buffered
sequential file.
// WARNING! unsigned Applets may not read the local hard disk.
// To copy/download files see
http://mindprod.com/products.html#FILETRANSFER.
// import java.io.*;
// O P E N
FileInputStream fis = new FileInputStream( "C:/temp/temp.in" );
InputStreamReader eisr = new InputStreamReader( fis,"UTF-8" );
// usually UTF-8 for Unicode 8-bit giving the full Unicode set, no
BOMs.
// Cp437 is IBM OEM.
// See "encoding" in the Java glossary for alternatives
// such as UTF-16BE for 16-bit, big-endian Unicode characters.
BufferedReader br = new BufferedReader( eisr, 4096 /* buffsize */ );
// R E A D
char[] ca = new char[1024];
// -1 means eof.
// You don't necessarily get all you ask for in one read.
// You get what's immediately available.
int charsRead = br.read( ca );
String line;
// File being read need not have have a terminal \n.
// File being read may safely use any mixture of \r\n, \r or \n line
terminators.
line = br.readLine();
// line == null means EOF
int aChar;
aChar = br.read();
// aChar == -1 means EOF
// C L O S E
br.close();
--------------------------------------
// Write Locale-encoded chars ( usually 8 bit ) into a buffered
sequential file.
// WARNING! unsigned Applets may not write to the local hard disk.
// To copy/download files see
http://mindprod.com/products.html#FILETRANSFER.
// import java.io.*;
// O P E N
FileOutputStream fos = new FileOutputStream( "C:/temp/temp.out" );
OutputStreamWriter eosw = new OutputStreamWriter( fos,"UTF-8" );
// usually UTF-8 for Unicode 8-bit giving the full Unicode set, no
BOMs.
// Cp437 is IBM OEM.
// See "encoding" in the Java glossary for alternatives
// such as UTF-16BE for 16-bit, big-endian Unicode characters.
BufferedWriter bw = new BufferedWriter( eosw, 4096 /* buffsize */ );
PrintWriter prw = new PrintWriter( bw, false /* auto flush on println
*/ );
// W R I T E
prw.write( "platypus" );
prw.println( 149 );
prw.flush();
// C L O S E
prw.close();
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
learnyourabc - 21 Jun 2007 15:01 GMT
Finally found the information at
http://home.tiscali.nl/t876506/utf8tbl.html
Search for 226.128.152 get
226.128.152 E28098 ‘ LEFT SINGLE QUOTATION MARK
The three bytes ‘ does refers to the LEFT SINGLE QUOTATION MARK.
Why is it that I,m using UTF-8 encoding in the String class that it
does not convert to the LEFT SINGLE QUOTATION MARK ???
Lew - 21 Jun 2007 15:36 GMT
> Finally found the information at
> http://home.tiscali.nl/t876506/utf8tbl.html
[quoted text clipped - 4 lines]
> Why is it that I,m using UTF-8 encoding in the String class that it
> does not convert to the LEFT SINGLE QUOTATION MARK ???
Maybe because the characters do not comprise three bytes at all, but, what,
seven in UTF-8? Five? I don't remember the rules exactly and I feel too lazy
to look them up right this moment. I'm sure you could do that just as well as
anyone. Regardless, the bytes in your source data do not correspond to the
Unicode code point you seem to think they should.
'â' == '\u00E2'
'€' == '\u20AC' requires more than one byte
'˜' == '\u02DC' requires more than one byte
AFAIR 0xE2 is not a UTF-8 byte value that indicates a three-byte encoded value.
Two questions:
What does a hex editor show you for those values in your source data?
What is the Unicode code point of "LEFT SINGLE QUOTATION MARK"?

Signature
Lew
Roedy Green - 22 Jun 2007 11:28 GMT
> Regardless, the bytes in your source data do not correspond to the
>Unicode code point you seem to think they should.
>
>'â' == '\u00E2'
>'' == '\u20AC' requires more than one byte
>'' == '\u02DC' requires more than one byte
http://mindprod.com/jgloss/utf.html
It includes a description of the algorithm and code to encode/decode.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Joshua Cranmer - 21 Jun 2007 17:25 GMT
> Finally found the information at
> http://home.tiscali.nl/t876506/utf8tbl.html Search for 226.128.152 get
> 226.128.152 E28098 ‘ LEFT SINGLE QUOTATION MARK The three bytes ‘
> does refers to the LEFT SINGLE QUOTATION MARK.
For the last time, ‘ != LEFT SINGLE QUOTATION MARK. The ‘ is the
ISO-8859-1 misrepresentation of the proper UTF-8 encoded string, so
something somewhere is mangling the UTF-8 into Cp1252. The only possible
way to get ‘ to refer to the LEFT SINGLE QUOTATION MARK is to force
something to not think of them as Cp1252 encoding but to think of them as
the UTF-8 encoding.
> Why is it that I,m using UTF-8 encoding in the String class that it does
> not convert to the LEFT SINGLE QUOTATION MARK ???
Chances are that the String class is by default using the Cp1252 encoding
to convert to/from bytes; try manually using the UTF-8 encoding.