...

Signature
Andrew Thompson
http://www.athompson.info/andrew/
Andrew Thompson <u32984@uwe> wrote in <7b19c9fce68f6@uwe>:
>> I want to convert a pdf file to a xml which has not
>> only the text
[quoted text clipped - 4 lines]
> objects *as opposed to* representation/rendering of that
> data - 'how it will look'.
Um, I believe that's incorrect. One of XML design goals as
stated in XML 1.0 4E, 1.1 is:
XML shall support a wide variety of applications.
XML, essentially, is pure syntax with no semantics. What
semantics of a particular XML application are aimed at
expressing is up to the people who designed that
application. It might be semantic document markup
(DocBook), presentational document markup (XSL-FO),
something that would take a Semantic Web zealot[*] to
describe properly (RDF/XML), a program in Turing-complete
programming language (XSLT), remote method invocation
(SOAP), vector graphics (SVG) or pretty much anything else
that could be represented as a tree.
> Rendering instructions for the data in XML might be
> encoded into an XSLT File.
You're probably thinking of XSL-FO (which is an XML
application itself). XSLT problem domain is a bit wider,
it's a fairly powerful document transformation language, -
although, indeed, one of the primary use cases considered
when the spec was being developed was transformation of
random XML documents to XSL-FO. Oh, and XSLT is an XML
application as well. *shrug*
> Trying to shove a WYSIWYG document into 'XML' seems
> contrary to what XML is for.
XSL-FO is an XML application that is, in short,
presentational markup for paged media. Apache FOP is a FOSS
XSL-FO processor oft-used to convert XSL-FO documents
(usually generated from something else) to PDF documents. I
haven't heard of anyone doing it the other way 'round
(PDF -> XSL-FO), but Google might have.
[*] meaning no disrespect to Semantic Web zealots

Signature
...also, I submit that we all must honourably commit seppuku
right now rather than serve the Dark Side by producing the
HTML 5 spec.
I want to get the text's layout information to analyse the pdf
file. The xml is just a way of express it.
for example, I want to extract the text of a pdf file like this.
<line fontname="..." fontsize=" " startx=" " starty=" " endx
endy ....>
a line.
</line>
Andrew Thompson
> ..
> > I want to convert a pdf file to a xml which has not only the text
[quoted text clipped - 17 lines]
> Message posted via JavaKB.com
> http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1
Andrew Thompson - 12 Nov 2007 14:14 GMT
>I want ..
Please refrain fom top-posting.

Signature
Andrew Thompson
http://www.athompson.info/andrew/
Wildemar Wildenburger - 12 Nov 2007 16:05 GMT
> I want to get the text's layout information to analyse the pdf
> file. The xml is just a way of express it.
[quoted text clipped - 3 lines]
> a line.
> </line>
If that is all you want to do, I would suggest you don't bother with the
XML at all. You can do an analysis of the properties of the text just as
well (heck, better?) using direct java data structures. Right?
/W
> ..
>
[quoted text clipped - 7 lines]
>
>I just want to get the text of a pdf file and the layout information. The xml file is just a way of storing it.
A possible result may be:
<line number="1"fontname="" fontsize="" startx="" ....>
a line
</line>