Java Forum / General / September 2005
XML configurable formatter/"Pretty Printer"
Chris - 28 Sep 2005 15:33 GMT Please set your reader to fixed width for readabilities sake...
I'm aware of the plethora of XML formatters (e.g. though with Xerces, JDOM, DOM4J), but I have a few special wants.
- To be able to drop attributes down to the next line if there are more than a certain number of them, or if they exceed a certain length. - To be able to not only wrap text content but indent it at the current indentation level. - To be able to set wrap and indentation levels for CDATA sections.
For example:
<one two="asdfghjasdfhjk" three="asdfhsajkdl" four="asdfasdfhjkl" five="asfahjkle"
<six>content</six> <seven>content</seven> <eight> This content is longer than eight characters and will .... This content is longer than eight characters and will .... This content is longer than eight characters and will .... <eight> <nine> <![CDATA[ This is CDATA content that I would like to be formatted as well. This is CDATA content that I would like to be formatted as well. This is CDATA content that I would like to be formatted as well. ]]> </nine> <one>
I've googled for a couple of days and while there are a lot of formatters out there none seem to be as powerful as I'd like. Ideally there should be a formatter that allows you to set the position of every token (element, attribute, attribute text, text, CDATA etc.) in a nice configurable and extensible way. Maybe even different element formatting depending on it's nested level.
We have to look at a *lot* of XML that is produced elsewhere and some of it abuses XML design so much it makes my eyes bleed. We already have a custom XML viewer/search tool (using DOM4J's Outputformatter to display), but we still have a lot of xml that is terrible to look at.
So two questions: is there anything Java based that can do this? I'm not expecting there to be so that leads me to the next question: how would you go about implementing this? Custom Sax parser? Extend an already existing formatter? Use a parser generator? Remember I'm pretty much stuck with a java solution.
Ideas?
Thanks for your time,
~Chris
Oliver Wong - 28 Sep 2005 17:51 GMT > Please set your reader to fixed width for readabilities sake... > [quoted text clipped - 3 lines] > - To be able to drop attributes down to the next line if there are more > than a certain number of them, or if they exceed a certain length. So far so good...
> - To be able to not only wrap text content but indent it at the current > indentation level. > - To be able to set wrap and indentation levels for CDATA sections. Doesn't this change the content of the XML document, so that the "pretty-printed" document is no longer semantically equivalent to the original document?
- Oliver
Chris - 28 Sep 2005 18:38 GMT Yep, good point.
But as this utility is just formatting for human eyes, it doesn't matter much if whitespace is added. I already have something roughly equal to "view original" source in my XML viewer, if you wanted to see the exact placement. I guess that feature would be more of an XML renderer than a formatter.
Still, just being able to format how attributes are placed would be handy, I can live without formatting for text and CDATA sections. Some people give me XML that has dozens of attributes (ugh... I know), so placing them intelligently as to be able to read without scrolling would be a god send.
Thanks,
~Chris
Oliver Wong - 28 Sep 2005 21:09 GMT > Yep, good point. > [quoted text clipped - 9 lines] > placing them intelligently as to be able to read without scrolling > would be a god send. Have you considered using a XML Viewer with word wrapping support?
If you just want an XML pretty printer which puts every attribute on its own line (instead of measuring the length of the attributes, and then putting them on a newline if they exceed 80 characters, for example), I'd imagine writing such a pretty printer would be "relatively" trivial.
Wouldn't you just be writing a SAX Parser that keeps an "indentation" variable (perhaps as an int), and increment it with every "startElement" event, decrement it with every "endElement" event, and do some slightly clever iterating for the attributes within the startElement event?
- Oliver
Chris - 29 Sep 2005 05:53 GMT "Have you considered using a XML Viewer with word wrapping support?"
The application I want to put it in *is* a specialized XML Viewer. :) Choosing a general one would negate several of my very specific requirements.
"Wouldn't you just be writing a SAX Parser that keeps an "indentation" variable..."
That's exactly what I'm doing, but another piece of software means another piece of software to maintain. I thought I'd give due diligence into software reuse, but since there's nothing like what I'm looking for, it's off to Eclipse I go...
Roedy Green - 28 Sep 2005 19:41 GMT >So two questions: is there anything Java based that can do this? I'm >not expecting there to be so that leads me to the next question: how >would you go about implementing this? Custom Sax parser? Extend an >already existing formatter? Use a parser generator? Remember I'm >pretty much stuck with a java solution. You made a comment that suggested you might need a DTD or equivalent to implement what you consider acceptable formatting rules. Those schemas (is that the correct plural?) could be hard to come by. I think the original idea was that every XML document would have a corresponding DTD, but there are lots of supposedly XML documents with other schemas and many without any at all and some I gather than could not even in theory have a DTD.
So ... it seems to me you have two options. Concoct a set of rules that can be implemented without a DTD, or use a tool such as Altova which if memory serves, will generate you a DTD from a document, perhaps with a little manual tweaking. http://www.altova.com/matrix_x.html
Altova Enterprise is very expensive. Perhaps it by itself would be sufficient.
If you are lucky, perhaps the documents you are dealing with all come with a single type of schema.
I enjoy this sort of coding. I could write you such a beast for around $300 US, price agreed in advance. You would give me your spec, and sample documents you had manually formatted to your taste to clarify the meaning of your spec. This would give you non-exclusive rights to the code.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Again taking new Java programming contracts.
Chris Smith - 28 Sep 2005 19:55 GMT > schemas (is that the correct plural?) Technically, it's supposed to be "schemata". That sounds silly, though, and everyone I've talked to says "schemas". Some dictionaries are even listing "schemas" as a valid plural now.
> I think the original idea was that every XML document would have a > corresponding DTD, but there are lots of supposedly XML documents with > other schemas and many without any at all and some I gather than could > not even in theory have a DTD. The only possible set of well-formed XML documents that truly could not have a DTD is one that allows an arbitrary element or attribute name on the document element. Aside from that, a DTD can be devised that includes any possible set of XML documents. For any non-trivial type of data, it would be impossible to precisely describe the set of correct documents using a DTD (or XML Schema, or anything else). Of course, the true challenge is to create a DTD that does not contain *too many* XML documents outside of the set, and/or that excludes certain likely errors.
In all cases, though, you have this relationship:
well-formed <= valid <= correct
That is, the set of well-formed XML documents is a superset of those that are valid in a particular instance, and the set of valid documents is a superset of those that are really correct. In almost all cases, you can replace "superset" with "strict superset" above.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Oliver Wong - 28 Sep 2005 21:16 GMT > The only possible set of well-formed XML documents that truly could not > have a DTD is one that allows an arbitrary element or attribute name on [quoted text clipped - 5 lines] > documents outside of the set, and/or that excludes certain likely > errors. Assuming one wants a DTD which exactly describes the acceptable set of XML documents (i.e. which contains *ZERO* XML documents outside of the set), I'd assume there's an infinite number of sets of XML documents which could not be described by DTD. I haven't actually done an analysis on the expressive power of DTDs, but I'd imagine that they are either as powerful as context-free grammars, or as powerful as Turing Machines, both of which have limitations on what languages (i.e. sets of documents) they can describe. (E.g. accept only the XML documents which, given some suitable encoding mechanism, represent programs/input pairs which will eventually halt).
If one allows the DTD to describe a set which is a superset of the desired set of legal XML documents, then one could always trivially write whatever DTD's equivalent is to "accept everything and anything".
- Oliver
Chris Smith - 28 Sep 2005 21:26 GMT > Assuming one wants a DTD which exactly describes the acceptable set of > XML documents (i.e. which contains *ZERO* XML documents outside of the set), > I'd assume there's an infinite number of sets of XML documents which could > not be described by DTD. Yes, that's what I said.
> If one allows the DTD to describe a set which is a superset of the > desired set of legal XML documents, then one could always trivially write > whatever DTD's equivalent is to "accept everything and anything". Yes, that's what I said, as well... except that there are certain restrictions to what can be said in a DTD. The document (root) element must match a specific tag name in the DTD, so at least one tag in the document -- the root element -- must be described in the DTD in order for the document to validate against the DTD
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Oliver Wong - 28 Sep 2005 21:39 GMT >> Assuming one wants a DTD which exactly describes the acceptable set >> of [quoted text clipped - 5 lines] > > Yes, that's what I said. Sorry, I didn't read your message carefully enough. In particular, I misread your statement "a DTD can be devised that includes any possible set of XML documents" to mean "a DTD can be devised that describes any possible set of XML documents."
To address the OP's original concern though, I don't think DTDs even enter into the picture for what his/her concerns are, given that (s)he doesn't even care if the pretty-printed document doesn't have the same semantics as the original.
- Oliver
Jaakko Kangasharju - 29 Sep 2005 07:54 GMT > Assuming one wants a DTD which exactly describes the acceptable > set of XML documents (i.e. which contains *ZERO* XML documents > outside of the set), I'd assume there's an infinite number of sets > of XML documents which could not be described by DTD. You're right, and this has been known for over a hundred years[1] :)
> I haven't actually done an analysis on the expressive power of DTDs, > but I'd imagine that they are either as powerful as context-free > grammars, or as powerful as Turing Machines DTDs actually form a subset of what is called regular tree grammars. The "regular" doesn't quite mean the same as with strings; regular tree grammars resemble context-free languages more. The paper at http://www.mulberrytech.com/Extreme/Proceedings/html/2001/Murata01/EML2001Murata 01.html by Murata et al. classifies different XML schema languages according to their expressive power.
[1] It follows directly from the fact that a set is smaller than its power set.
 Signature Jaakko Kangasharju, Helsinki Institute for Information Technology () ASCII RIBBON CAMPAIGN /\ AGAINST HTML MAIL
Roedy Green - 28 Sep 2005 22:40 GMT >The only possible set of well-formed XML documents that truly could not >have a DTD is one that allows an arbitrary element or attribute name on >the document element. e.g. ant where you can concoct your own tasks.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Again taking new Java programming contracts.
Chris - 28 Sep 2005 20:03 GMT Roedy,
I already have XMLSpy, as it's pretty useful for writing schemas, but I'm not sure I get what you are talking about when you suggest DTD.
Yes, DTD's and xml schemas (xsd's) do set the format of an xml document in that they specify the order, type, constraints and plurality of elements and attributes. But they don't have any use whatsoever in how an XML files is textually laid out. Perhaps "format" is the wrong word, but that's why I put "pretty print" in the subject. To be more clear, I want a tool that will take:
<one two="asdfghjasdfhjk" three="asdfhsajkdl" four="asdfasdfhjkl" five="asfahjkle"> <six>content</six> <seven>content</seven> <eight> This content is longer than eight characters and will .... This content is longer than eight characters and will .... This content is longer than eight characters and will .... <eight> <nine> <![CDATA[ This is CDATA content that I would like to be formatted as well. This is CDATA content that I would like to be formatted as well. This is CDATA content that I would like to be formatted as well. ]]> </nine><one>
And display it as laid out in my original post. Both examples would be validated by the same DTD or schema (though I may have missed a bracket when pasted so they probably arent valid here), but the difference in readability is striking.
Thanks,
~Chris
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|