Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2008

Tip: Looking for answers? Try searching our database.

SAX and invalid chars

Thread view: 
Christian - 10 Jan 2008 01:23 GMT
Hello

My Problem is that I have to parse an XML file that contains som invalid
chars (i.e. 0x0E or 0x1E)

So running this normally will break parsing.
Though easy solution I could think of would be create a stream to pipe
the input through an filter lower bytes out.
The problem is that if my XML is not in windows-1252 but some other char
encoding I might break encoding by this.

Is there any patent solution to the problem?

Christian
Arne Vajhøj - 10 Jan 2008 01:31 GMT
> My Problem is that I have to parse an XML file that contains som invalid
> chars (i.e. 0x0E or 0x1E)
[quoted text clipped - 6 lines]
>
> Is there any patent solution to the problem?

Do the same as the XML parser.

Read the XML header and get encoding from there.

Arne
Mike Schilling - 10 Jan 2008 06:00 GMT
>> My Problem is that I have to parse an XML file that contains som
>> invalid chars (i.e. 0x0E or 0x1E)
[quoted text clipped - 10 lines]
>
> Read the XML header and get encoding from there.

Which is easy if you know that the XML file is in some superset of
ASCII, since the entrie XML header will then be in ASCII.  It's
tricker if the XML file might be in any encoding at all (e.g. EBCDIC,
UTF-16, etc.)  In the latter case, look at Appendix F
(http://www.w3.org/TR/REC-xml/#sec-guessing) for some useful tips.
Christian - 11 Jan 2008 11:29 GMT
Mike Schilling schrieb:
>>> My Problem is that I have to parse an XML file that contains som
>>> invalid chars (i.e. 0x0E or 0x1E)
[quoted text clipped - 15 lines]
> UTF-16, etc.)  In the latter case, look at Appendix F
> (http://www.w3.org/TR/REC-xml/#sec-guessing) for some useful tips.

Thx for your pointers..

Though the solution seems to be to heavy  ... and as I am only expecting
utf-8 and windows-1252 I probably do with the hack of just removing the
bytes ... (and search the api now if there is some easy way to throw an
exception if none of these encodings are used..)

thx


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.