Java Forum / General / October 2007
Convert HTML to XML
earth_792 - 23 Oct 2007 03:55 GMT Hello, All!
Does anyone have any ideas how to convert Html into XML by using Java? I know there is java API that do opposite way(XML to XHTML). Any ideas or links would be helpful. :)
Thank you
Andrew Thompson - 23 Oct 2007 04:27 GMT ...
>Does anyone have any ideas how to convert Html into XML by using >Java? I know there is java API that do opposite way(XML to XHTML). XHTML != HTML
>Any ideas or links would be helpful. :) XSLT (shrugs) <http://www.google.com/search?q=xslt>
>Thank you No worries.
 Signature Andrew Thompson http://www.athompson.info/andrew/
gray - 23 Oct 2007 13:16 GMT fyi. http://java-source.net/open-source/html-parsers
> Hello, All! > [quoted text clipped - 3 lines] > > Thank you Andy Dingley - 23 Oct 2007 13:48 GMT > Does anyone have any ideas how to convert Html into XML by using > Java? This depends on what you mean by "HTML". If it's guaranteed to be well-formed and valid, then it's a simple matter - use an SGML or HTML parser, then output the DOM as XML.
If it's "typical" HTML "tag soup", then this is fundamentally a much more difficult task. You can't convert with a simple automatic process, at times you have to infer "what the author meant" rather than "what they wrote". I suggest reading up on HTML Tidy, which isn't (AFAIK) ported to Java, but does discuss the problems and their solutions.
If you're trying to embed HTML in RSS (which is usually an XML protocol) or similar, then you don't even need to "convert HTML to XML", you just neeed to encode the relevant entities (such as "<" and ">") into a CDATA section. That's _much_ easier, you don't even need a HTML parser, just a simple character-by-character scan and replace.
On the whole though, I can't imagine many cases when it really is necessary to "convert HTML to XML". Just about the only one is loading legacy web sites into a new XML-based CMS.
If you give us more context, then you might get more relevant advice.
earth_792 - 23 Oct 2007 14:20 GMT > > Does anyone have any ideas how to convert Html into XML by using > > Java? [quoted text clipped - 21 lines] > > If you give us more context, then you might get more relevant advice. ********************** I just want to say "Thank you very much" all of you for reply my post. Now, I understand what I should do. My initial problem is "to convert legacy (not well format, valid) html into a new HTML(valid, new presentation). I don't want to cut and paste content from legacy ones to the new ones. There have thousands of pages. So, I thought if I can convert HTML into XML and then use XSLT to convert back to a new HTML. :))
Daniel Pitts - 23 Oct 2007 17:45 GMT >>> Does anyone have any ideas how to convert Html into XML by using >>> Java? [quoted text clipped - 29 lines] > I can convert HTML into XML and then use XSLT to convert back to a new > HTML. :)) Look into Tidy, it is a program (there is a Java interface to it too if you don't want to use the command line). It will reformat HTML into well-formed HTML. Modern HTML (aka XHTML) *is* XML. So you don't need to convert it to XML and then back to XHTML.
Hope this helps, Daniel.
 Signature Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Sherman Pendley - 24 Oct 2007 03:38 GMT > Look into Tidy, it is a program (there is a Java interface to it too > if you don't want to use the command line). It will reformat HTML > into well-formed HTML. Modern HTML (aka XHTML) *is* XML. So you don't > need to convert it to XML and then back to XHTML. Agreed about Tidy.
The final output format should be HTML though, not XHTML. XHTML will not render at all in IE6/7 when served correctly as application/xhtml+xml. IE will render it when served as text/html, but uses its HTML engine to do so. That being the case, it's better to give it valid HTML to work with, then to give it XHTML that relies on the HTML engine's error handling to parse correctly.
sherm--
 Signature Web Hosting by West Virginians, for West Virginians: http://wv-www.net Cocoa programming in Perl: http://camelbones.sourceforge.net
Daniel Pitts - 24 Oct 2007 05:08 GMT >> Look into Tidy, it is a program (there is a Java interface to it too >> if you don't want to use the command line). It will reformat HTML [quoted text clipped - 11 lines] > > sherm-- Um, what are you talking about? XHTML *is* valid HTML. If you have to lie about the content type, thats one thing, but XHTML should be used going forward. non-XML HTML has been deprecated, and the sooner browser writers and content providers realize this, the better the world will be.
 Signature Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Sherman Pendley - 24 Oct 2007 07:06 GMT >>> Look into Tidy, it is a program (there is a Java interface to it too >>> if you don't want to use the command line). It will reformat HTML [quoted text clipped - 13 lines] >> > Um, what are you talking about? XHTML *is* valid HTML. Not at all. XHTML is an XML application. HTML is an SGML application. The two are not the same. For instance, this is valid XHTML, but not valid HTML:
<img src="foo.jpg" />
> If you have to > lie about the content type, thats one thing, but XHTML should be used > going forward. The fact that you have to lie about the content type is what makes XHTML unusable for the WWW. You're delivering it as HTML, and it will be parsed as such. Name spaces will not be parsed, and short-tag forms such as the img example above will be handled as slightly-broken HTML, not as short form XML tags.
In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny extra slashes here and there. That being the case, why not simply deliver the HTML correctly, without the XHTML baggage to begin with?
> non-XML HTML has been deprecated Nonsense. The W3C's HTML Work Group was resurrected, and the effort to standardize HTML 5 was started earlier this year:
<http://www.w3.org/html/>
As explained in the "why" link, XHTML was a nice idea , but it didn't pan out in practice because of dismal browser support.
>, and the sooner > browser writers and content providers realize this, the better the > world will be. Sometimes the latest hot ticket just doesn't work out. No sense getting religious about it - just move on.
sherm--
 Signature Web Hosting by West Virginians, for West Virginians: http://wv-www.net Cocoa programming in Perl: http://camelbones.sourceforge.net
Daniel Pitts - 24 Oct 2007 17:28 GMT >>>> Look into Tidy, it is a program (there is a Java interface to it too >>>> if you don't want to use the command line). It will reformat HTML [quoted text clipped - 17 lines] > > <img src="foo.jpg" /> Are you sure that's not valid HTML? XML is a subset of SGML, and I would think that <shortForm /> was valid SGML as well. <br /> is valid HTML.
>> If you have to >> lie about the content type, thats one thing, but XHTML should be used [quoted text clipped - 5 lines] > img example above will be handled as slightly-broken HTML, not as short > form XML tags. It's called deprecation. Tell your users that they need the latest browsers to see your site. I know that isn't always possible, but you can say at a certain point that you're no longer supporting Mosaic and Netscape 3 :-)
> In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny > extra slashes here and there. That being the case, why not simply deliver > the HTML correctly, without the XHTML baggage to begin with? Because you gain so much with using XHTML, including the fact that many popular JavaScript libraries require XHTML-strict to work properly. Next you're going to tell me that you shouldn't use CSS.
>> non-XML HTML has been deprecated > [quoted text clipped - 14 lines] > > sherm--
 Signature Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Lew - 25 Oct 2007 01:01 GMT Sherman Pendley wrote:
>> <img src="foo.jpg" />
> Are you sure that's not valid HTML? XML is a subset of SGML, and I > would think that <shortForm /> was valid SGML as well. <br /> is valid > HTML. I use the short form for all my img, br, input and similar tags, and it works just fine on every browser I've tried.
 Signature Lew
Sherman Pendley - 25 Oct 2007 04:59 GMT > Sherman Pendley wrote: >>> <img src="foo.jpg" /> [quoted text clipped - 5 lines] > I use the short form for all my img, br, input and similar tags, and > it works just fine on every browser I've tried. The short form is not valid HTML; it is not allowed according to the HTML specifications found at <http://w3c.org>.
If you'll look at the definition of the term "valid" there, I don't think you'll find the phrase "works just fine on every browser Lew tried." :-)
sherm--
 Signature Web Hosting by West Virginians, for West Virginians: http://wv-www.net Cocoa programming in Perl: http://camelbones.sourceforge.net
Lew - 25 Oct 2007 05:38 GMT > If you'll look at the definition of the term "valid" there, I don't think > you'll find the phrase "works just fine on every browser Lew tried." :-) Drat. Now I may have to go and actually learn something.
 Signature Lew
Sherman Pendley - 25 Oct 2007 04:51 GMT >>> Um, what are you talking about? XHTML *is* valid HTML. >> [quoted text clipped - 4 lines] > > Are you sure that's not valid HTML? Certain. Look it up: <http://w3c.org>
> It's called deprecation. Tell your users that they need the latest > browsers to see your site. No. Why would I do such a stupid thing as that?
>> In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny >> extra slashes here and there. That being the case, why not simply deliver [quoted text clipped - 3 lines] > many popular JavaScript libraries require XHTML-strict to work > properly. Is that meant to be a joke?
> Next you're going to tell me that you shouldn't use CSS. Um - why would I tell you that?
sherm--
 Signature Web Hosting by West Virginians, for West Virginians: http://wv-www.net Cocoa programming in Perl: http://camelbones.sourceforge.net
Daniel Pitts - 25 Oct 2007 05:11 GMT >>>> Um, what are you talking about? XHTML *is* valid HTML. >>> Not at all. XHTML is an XML application. HTML is an SGML application. The [quoted text clipped - 4 lines] > > Certain. Look it up: <http://w3c.org> That's a rather large site to look up the information that says <tag /> is invalid. How about pointing me to at least the right section, eh?
As an aside, I did find this interesting. <http://www.w3.org/QA/2007/10/shorttags.html> Apparently there are some shortcuts available to HTML users that aren't for XML users. For example '<p<a href="/">Some Link</> some text' is supposedly equivalent to <p><a href="/">Some Link</a> some text
>> It's called deprecation. Tell your users that they need the latest >> browsers to see your site. > > No. Why would I do such a stupid thing as that? Same reason people don't write Java 1.2 code anymore. If you're content is valuable enough, people will upgrade for it.
>>> In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny >>> extra slashes here and there. That being the case, why not simply deliver [quoted text clipped - 10 lines] > > sherm--
 Signature Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Lasse Reichstein Nielsen - 25 Oct 2007 08:22 GMT >> The >> two are not the same. For instance, this is valid XHTML, but not valid HTML: >> <img src="foo.jpg" />
> Are you sure that's not valid HTML? XML is a subset of SGML, and I > would think that <shortForm /> was valid SGML as well. <br /> is > valid HTML. It is (part of) valid HTML, because HTML, as an SGML application, has the SHORTTAG feature enabled (<URL:http://www.w3.org/TR/html401/sgml/sgmldecl.html>, notice "SHORTTAG YES").
However, what it means is probably not what you think it means. Using shorttag notation, these two paragraphs are equivalent: <p>this is a test</p> <p/this is a test/ Writing <p/> means that the closing ">" is not part of the tag, but part of the text content! Luckily no widely used browser understands shorttags on element. <URL:http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.7>
I'm actually not sure you can use shorttag with "br" elements, as they are empty and can have no end tag. I guess the validator could tell, but I'll just avoid it.
> It's called deprecation. Tell your users that they need the latest > browsers to see your site. I know that isn't always possible, but you > can say at a certain point that you're no longer supporting Mosaic and > Netscape 3 :-) Yes. This problem should be over when we no longer have to support IE 6. Or is it a problem for IE 7 too? (there is a proper solution for IE though, <URL:http://www.w3.org/MarkUp/2004/xhtml-faq#ie>)
> Because you gain so much with using XHTML, including the fact that > many popular JavaScript libraries require XHTML-strict to work > properly. Javascript libraries work with the DOM structure. They don't care about the syntax of the page markup. That's entirely up to the browser to parse. I you were right, and since IE 6 understands XHTML as malformed HTML anyway, then those libraries shouldn't work IE6 anyway.
If they use Ajax to request further content, then it's fine to send XML, but it doesn't have to be XHTML at all, and probably shouldn't.
A lot of JavaScript won't work with properly sent XHTML, because using an XML parser precludes using the document.write feature.
> Next you're going to tell me that you shouldn't use CSS. Well, IE 6 still doesn't support most of CSS2, which has been a standard since 1997. If you need to support IE6, and you do, then you'll have to make it work with only the supported subset.
/L
 Signature Lasse Reichstein Nielsen - lrn@hotpop.com DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html> 'Faith without judgement merely degrades the spirit divine.'
Martin Gregorie - 23 Oct 2007 19:27 GMT > I suggest reading up on HTML Tidy, which isn't > (AFAIK) ported to Java, but does discuss the problems and their > solutions. I agree HTMLtidy http://tidy.sourceforge.net/ is a great tool. I won't create a web page without using it.
There is a Java port, Jtidy http://sourceforge.net/projects/jtidy I haven't used this version, but I notice that the project provides both the stand-alone Jtidy utility and a DOM parser for HTML.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Hunter Gratzner - 23 Oct 2007 17:11 GMT > Does anyone have any ideas how to convert Html into XML by using > Java? I know there is java API that do opposite way(XML to XHTML). > Any ideas or links would be helpful. :) XHTML is XML. So any conversion HTML -> XHTML fulfills your requirement. Try jtidy.
Roedy Green - 23 Oct 2007 20:34 GMT On Tue, 23 Oct 2007 02:55:25 -0000, earth_792 <mike_nguyen4@hotmail.com> wrote, quoted or indirectly quoted someone who said :
>Does anyone have any ideas how to convert Html into XML by using >Java? I know there is java API that do opposite way(XML to XHTML). >Any ideas or links would be helpful. :) THere is a program called HTMLTidy that converts HTML to XHTML
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|