Java Forum / General / January 2008
Is there a MS Office to PDF conversion library
Eeby - 14 Jan 2008 13:31 GMT My boss asked me to research this question. He wants me to write a script / program that will convert a directory of Word, Excel, and PowerPoint documents into PDFs. I did some Google searching and found this:
http://www.activepdf.com/
However it is Windows-only. We have Linux servers.
Does anyone know of a Java library that I could use for this? Or any library in any language? PHP? Perl?
Any help or advice would be greatly appreciated.
Thanks,
E
Thomas Kellerer - 14 Jan 2008 13:50 GMT Eeby, 14.01.2008 14:31:
> My boss asked me to research this question. He wants me to write a > script / program that will convert a directory of Word, Excel, and [quoted text clipped - 9 lines] > > Any help or advice would be greatly appreciated. OpenOffice can generate PDF, can read MS Office and has an integration with Java. Maybe that could be a way for you.
Thomas
AL - 15 Jan 2008 05:08 GMT > Eeby, 14.01.2008 14:31: >> My boss asked me to research this question. He wants me to write a [quoted text clipped - 15 lines] > > Thomas Thomas,
I'm way in over my head here, but is this what you are referring to? http://codesnippets.services.openoffice.org/Office/Office.ConvertDocuments.snip
The examples seem to indicate user input to select a document for conversion but it appears to me a file list from a directory could be used to feed the conversion thereby converting all the files in a given directory to PDF. Unfortunately a couple links referenced in the snippets were invalid and one method was deprecated - I think it was newfile.toURL()
Anyway, my eyes are burning and I still have a long way to go to make sense of it all, but this seemed like the direction you were pointing the OP. (?)
AL
Andrew Thompson - 14 Jan 2008 14:23 GMT > My boss asked me to research this question. Did you dare to ask the point of this exercise? (Or are you just taking the money?*)
>..He wants me to write a > script / program that will convert a directory of Word, Excel, and > PowerPoint documents into PDFs. That seems relatively pointless and stupid. - About all that PDFs are good for is page layout. - Few 'something else' -> PDF converers will do any intelligent thing with the page layout in the conversion process. - If you can get a program that parses and reads the documents, you might as well just dump them direct to printer, without the file clutter of ever creating the PDF.
Which brings me back to..
What is the point of this exercise? (* And no - you ain't payin' me enough for me to 'settle for the money - no questions asked'.)
-- Andrew T. PhySci.org
AL - 14 Jan 2008 18:39 GMT >> My boss asked me to research this question.
> Did you dare to ask the point of this exercise? > (Or are you just taking the money?*) I'm curious why you would consider this any of your business? Maybe the OP's boss is one of those guys who is always thinking and wondering about stuff like, "gee, I wonder if there's a way to..., hey OP, how 'bout checking something out for me..." Once upon a time I had a boss like that and the diversity of assignments was incredibly satisfying, and educational. So, I guess *your* response would be, "go to hell, you don't pay me enough to do that crap without a 30 page RFI..." Oh, what a stellar employee you must be.
>> ..He wants me to write a >> script / program that will convert a directory of Word, Excel, and >> PowerPoint documents into PDFs.
> That seems relatively pointless and stupid. As does your response...
> - About all that PDFs are good for is page layout. What about sharing documents with others without having to consider which version of MS Office they may be running or whether they even have Office running or whether their version of Open Office can read the newest Word document? What if the "boss" is planning to publish these documents on a website - wouldn't PDF be a preferred format for downloading?
http://www.adobe.com/products/acrobat/adobepdf.html
> - Few 'something else' -> PDF converers will do > any intelligent thing with the page layout in > the conversion process. The OP didn't indicate that "any intelligent thing" was required - just conversion.
> - If you can get a program that parses and reads > the documents, you might as well just dump them > direct to printer, without the file clutter of > ever creating the PDF. The OP didn't indicate printing to be the primary objective.
> Which brings me back to.. > > What is the point of this exercise? > (* And no - you ain't payin' me enough for me > to 'settle for the money - no questions asked'.) Which leads me to wonder, what was the point of your response???
It may be that the OP asked a legitimate question you didn't have a clue how to answer (intelligently) so you chose to slap them around. Once upon a time I had a boss like that too - we had a name for him, bet I can guess yours...
AL
Martin Gregorie - 14 Jan 2008 22:17 GMT > What if the "boss" is planning to publish these > documents on a website - wouldn't PDF be a preferred format for > downloading? No, not unless there's a requirement to make the document somewhat unmodifiable: even a PDF can be cracked into and changed if you're determined enough.
HTML is better. Its smaller and faster to load, even it its MS Office generated HTML. Save the same document as a PDF and as HTML. Compare the file sizes with each other and with the original MS Office document. HTML < MS Office doc < PDF.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Joshua Cranmer - 14 Jan 2008 23:15 GMT >> What if the "boss" is planning to publish these documents on a website >> - wouldn't PDF be a preferred format for downloading? [quoted text clipped - 7 lines] > file sizes with each other and with the original MS Office document. > HTML < MS Office doc < PDF. PDF is an extremely rigid, final-proof-centric format. HTML is extremely loose and, even taking into account CSS through all current WDs (and thus exiting the world of even niche-browser support), resistant to certain concepts like pagination and final format designs.
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Andrew Thompson - 14 Jan 2008 23:30 GMT > PDF is an extremely rigid, final-proof-centric format. HTML is extremely > loose and, even taking into account CSS through all current WDs ...
WD? That's a new one on me!
War Department? Word Disparity? Will Dated? ..What?
-- Andrew T. PhySci.org
Eeby - 14 Jan 2008 23:37 GMT Thanks for the replies. That's very helpful. The reason I'm asked to research PDF conversion: the organization I work for posts documents on its website in MS Office formats. Management would like to post PDFs instead.
E
AL - 14 Jan 2008 23:41 GMT > Thanks for the replies. That's very helpful. The reason I'm asked to > research PDF conversion: the organization I work for posts documents > on its website in MS Office formats. Management would like to post > PDFs instead. > > E FWIW, I agree with management.
AL
Andrew Thompson - 14 Jan 2008 23:55 GMT > Thanks for the replies. That's very helpful. The reason I'm asked to > research PDF conversion: the organization I work for posts documents > on its website in MS Office formats. Management would like to post > PDFs instead. That does not explain *why*.
Why would management prefer to put PDFs (which are higher bandwidth than the equivalent MS Doc.) on the site?
-- Andrew T. PhySci.org
AL - 15 Jan 2008 00:04 GMT >> Thanks for the replies. That's very helpful. The reason I'm asked to >> research PDF conversion: the organization I work for posts documents >> on its website in MS Office formats. Management would like to post >> PDFs instead.
> That does not explain *why*. Just for grins & giggles consider that outside the OP's realm of responsibility & authority and explain *how*.
AL
Steve Sobol - 15 Jan 2008 00:31 GMT > Why would management prefer to put PDFs > (which are higher bandwidth than the > equivalent MS Doc.) on the site? So people without Microsoft Office can read them. Yes, MS has free Office document viewers, but plenty of people already have Acrobat Reader or another PDF viewer installed. Plus, if you don't run Windows you may be SOL if you need to view the document (maybe, maybe not on a Mac, definitely on other platforms). PDF is pretty ubiquitous and viewers are available for every common computing platform.
 Signature Steve Sobol, Victorville, CA PGP:0xE3AE35ED www.SteveSobol.com Geek-for-hire. Details: http://www.linkedin.com/in/stevesobol
Andrew Thompson - 15 Jan 2008 01:06 GMT > > Why would management prefer to put PDFs > > (which are higher bandwidth than the > > equivalent MS Doc.) on the site? > > So people without Microsoft Office can read them. ... Wow! Are you the OP's manager?
Small world!
-- Andrew T. PhySci.org
Steve Sobol - 15 Jan 2008 05:15 GMT >> > Why would management prefer to put PDFs >> > (which are higher bandwidth than the [quoted text clipped - 3 lines] > > Wow! Are you the OP's manager? Of course I'm not, I'm just presenting a possible (probable?) answer.
 Signature Steve Sobol, Victorville, CA PGP:0xE3AE35ED www.SteveSobol.com Geek-for-hire. Details: http://www.linkedin.com/in/stevesobol
Andrew Thompson - 15 Jan 2008 05:33 GMT > >> > Why would management prefer to put PDFs > >> > (which are higher bandwidth than the [quoted text clipped - 5 lines] > > Of course I'm not, I'm just presenting a possible (probable?) answer. Fair enough, but I'd prefer not to speculate. I'm waiting to hear the OP's (OK the manager's) *actual* reason(s).
-- Andrew T. PhySci.org
Martin Gregorie - 15 Jan 2008 15:55 GMT >>> Why would management prefer to put PDFs >>> (which are higher bandwidth than the >>> equivalent MS Doc.) on the site? >> So people without Microsoft Office can read them. ... Thats a pretty good reason given that M$ don't supply a viewer for OSen other than Winders and haven't seen fit to support an Open Source version.
IMO that lack trumps the bandwidth criticism of PDF.
I'm never happy to see an MSOffice document released on the web when a PDF, web page or even a JPG scanned image could be used almost as easily.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Arne Vajhøj - 15 Jan 2008 02:05 GMT >> Thanks for the replies. That's very helpful. The reason I'm asked to >> research PDF conversion: the organization I work for posts documents [quoted text clipped - 6 lines] > (which are higher bandwidth than the > equivalent MS Doc.) on the site? There are 3 good reasons to put PDF's instead of DOC's up:
1) readonly (not fully true, but it does not open up in a program capable of modifying it) 2) in general works better on non-Windows platforms 3) does not contain "extra information" (*)
Arne
*) There were a little incident in Denmark a couple of years ago where the prime minister send a speech to the press in DOC format. And the press looked at the document and could see that the DOC originally came from a man working in an industrial association. The IT department decided that all future speeches send to the press would be in PDF format.
Lew - 15 Jan 2008 02:24 GMT > 3) does not contain "extra information" (*) > [quoted text clipped - 6 lines] > IT department decided that all future speeches send to the press > would be in PDF format. Curious, that the Microsoft format would actually provide more transparency and greater knowledge of other's attempts to obfuscate than another format.
I suspect OpenOffice docs would have that advantage over PDF as well.
Perhaps the Danes should demand that their leaders publish only in formats that provide such "extra information". Shoot, I'd love it if we could identify those for whom our politicians are mouthpieces where I live, too.
 Signature Lew
Joshua Cranmer - 15 Jan 2008 02:35 GMT >> PDF is an extremely rigid, final-proof-centric format. HTML is extremely >> loose and, even taking into account CSS through all current WDs > ... > > WD? That's a new one on me! My fault for assuming that people were well-acquainted with the specification process of the W3C. `WD' stands for `Working Draft' (i.e., this is only a rough draft and the final outcome may look nothing like this.) Other levels are CR (Candidate Recommendation, probably stable), PR (Proposed Recommendation, a level only requiring two open, independent implementations to proceed), and REC (Recommendation, the real deal).
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Andrew Thompson - 15 Jan 2008 02:50 GMT > >> PDF is an extremely rigid, final-proof-centric format. HTML is extremely > >> loose and, even taking into account CSS through all current WDs [quoted text clipped - 4 lines] > My fault for assuming that people were well-acquainted with the > specification process of the W3C. Uggh.. I could become buried in W3C abbreviations. Their site content sometimes reads like a list of abbreviations with an occasional word thrown in (purely for stylistic effect).
>...`WD' stands for `Working Draft' (i.e., > this is only a rough draft and the final outcome may look nothing like > this.) Other levels are CR (Candidate Recommendation, probably stable), > PR (Proposed Recommendation, a level only requiring two open, > independent implementations to proceed), and REC (Recommendation, the > real deal). Hm... How about we consider *standards* to be the 'real deal' and demote recommendations* to something slightly less?
* I find it somewhat irritating that to get the 'major players' onboard with W3C, they had to (AFAIR) decide they would only ever make 'recommendations'.
-- Andrew T. PhySci.org
Joshua Cranmer - 15 Jan 2008 02:56 GMT > Hm... How about we consider *standards* to > be the 'real deal' and demote recommendations* > to something slightly less? A W3C Recommendation = standard for all practical measures. If you really want to mince words, the basis of every major protocol (HTTP, FTP, NNTP, SMTP, POP, IMAP, TLS, TCP/IP, UDP, etc.) comes from the RFCs... "Requests for Comments". If the full documentation for HTTP is technically nothing more than a request for people to comment on, than a Recommendation is closer to an actual standard.
Then again, MS's attempts to get OOXML passed as an ISO standard are showing just how well the largest standards organization is doing with their standards. I would rather read JLS 3 over ES 3 (the current version of Javascript) any day.
P.S. Sorry for the burst of acronyms, but I really don't want to write out all of these names...
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
RedGrittyBrick - 15 Jan 2008 10:56 GMT > If you > really want to mince words, the basis of every major protocol (HTTP, > FTP, NNTP, SMTP, POP, IMAP, TLS, TCP/IP, UDP, etc.) comes from the > RFCs... "Requests for Comments". If the full documentation for HTTP is > technically nothing more than a request for people to comment on, than a > Recommendation is closer to an actual standard. Not exactly true, RFCs can pass through a standards-track process that assigns a "status" to them:
"A specification that reaches the status of Standard is assigned a number in the STD series while retaining its RFC number." - IETF [1]
IETF STD-1 says that RFC 2616 (HTTP) currently has status "Draft Standard Protocol"
Whether an IETF "draft standard" like HTTP is closer to a "standard" than a W3C "recommendation" like HTML is something I don't wish to comment on :-)
[1] http://tools.ietf.org/html/rfc2026#section-4.1.3
Joshua Cranmer - 15 Jan 2008 22:34 GMT > Not exactly true, RFCs can pass through a standards-track process that > assigns a "status" to them: [quoted text clipped - 4 lines] > IETF STD-1 says that RFC 2616 (HTTP) currently has status "Draft > Standard Protocol" I know about the draft standard process--I actually have all of the drafts of RFC 3977 (the update of NNTP) since it relates to another project. I was mostly continuing the joke on the actual meanings of the names.
> Whether an IETF "draft standard" like HTTP is closer to a "standard" > than a W3C "recommendation" like HTML is something I don't wish to > comment on :-) Actual IETF drafts are not too close: an implementation of draft 15 of RFC 3977 would have some problems conforming to the actual RFC. I'm not sure about the numbered RFCs labeled "Draft" though.
P.S. I don't think its coincidence that RFC 2822 updates RFC 822 and RFC 3977 updates RFC 977...
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Arne Vajhøj - 16 Jan 2008 03:25 GMT >> If you really want to mince words, the basis of every major protocol >> (HTTP, FTP, NNTP, SMTP, POP, IMAP, TLS, TCP/IP, UDP, etc.) comes from [quoted text clipped - 14 lines] > than a W3C "recommendation" like HTML is something I don't wish to > comment on :-) The STD process is a later addon.
Arne
Arne Vajhøj - 15 Jan 2008 02:57 GMT >> ...`WD' stands for `Working Draft' (i.e., >> this is only a rough draft and the final outcome may look nothing like [quoted text clipped - 11 lines] > (AFAIR) decide they would only ever make > 'recommendations'. Considering that the internet is build on requests for comments, then recommendations is not that bad !
:-) Arne
Lew - 15 Jan 2008 03:10 GMT > Considering that the internet is build on requests for > comments, then recommendations is not that bad ! It's not the only context where a "recommendation" carries the force of a mandate. When my boss at work "recommends" that I take care of something, I could be unemployed if I decide I don't need to worry about that little thing, for example.
Actually, the word "recommendation" is quite apt. Take TCP/IP for example. There were, and most likely still are all kinds of protocols that one could use instead. One doesn't have to use TCP/IP - but it is recommended.
 Signature Lew
Joshua Cranmer - 14 Jan 2008 22:20 GMT >>> My boss asked me to research this question. > >> Did you dare to ask the point of this exercise? >> (Or are you just taking the money?*) > > I'm curious why you would consider this any of your business? There is an implicit requirement on Usenet--we, the responders, have full rights to criticize the methodology of any poster.
It also happens that, fairly often, the root problem can be more easily solved by a different methodology than the OP wants to use. This comes up quite frequently in the case of reflection in Java: most of the time, the best answer is to use something else.
> Maybe the OP's boss is one of those guys who is always thinking and > wondering about stuff like, "gee, I wonder if there's a way to..., hey > OP, how 'bout checking something out for me..." Once upon a time I had > a boss like that and the diversity of assignments was incredibly > satisfying, and educational. Is this the case with the OP right now?
For future reference, we only know what you tell us about the problem, and must therefore assume the rest. The proper response for "I need XXX to be done in YYY way" is going to be different than "Is YYY a suitable way to do XXX?"
>> - About all that PDFs are good for is page layout. > > What about sharing documents with others without having to consider > which version of MS Office they may be running or whether they even have > Office running or whether their version of Open Office can read the > newest Word document? I would recommend RTFs, but OOo tends to quickly munge these documents. In general, a Word 95 document should be supported by anyone who cares. Hell, MS even has the reference for one of its early Word file formats!
> Which leads me to wonder, what was the point of your response??? To point out that there might be other means to solve the unstated core problem than the way the OP has asked for.
I read once in a guideline for asking questions that the second of these two questions is preferred:
"Hi, I think I have a hairline crack on my motherboard; how would I check?"
"Hi, I am having a problem with my computer. I am getting random memory errors, [etc.]. What may be causing these problems, and how would I check?"
The question the OP asked was in the style of the former, that is, assuming the answer and asking it. I suspect that Andrew was attempting to glean the sort of information provided in the latter style.
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
AL - 14 Jan 2008 23:36 GMT >>>> My boss asked me to research this question.
>>> Did you dare to ask the point of this exercise? >>> (Or are you just taking the money?*)
>> I'm curious why you would consider this any of your business?
> There is an implicit requirement on Usenet--we, the responders, have > full rights to criticize the methodology of any poster. I recognize that right and freely exercise it myself.
>> Maybe the OP's boss is one of those guys who is always thinking and >> wondering about stuff like, "gee, I wonder if there's a way to..., hey >> OP, how 'bout checking something out for me..." Once upon a time I >> had a boss like that and the diversity of assignments was incredibly >> satisfying, and educational.
> Is this the case with the OP right now? The OP's exact circumstances are not known, so the sarcasm about "just taking the money" is pointless.
> For future reference, we only know what you tell us about the problem, > and must therefore assume the rest. The proper response for "I need XXX > to be done in YYY way" is going to be different than "Is YYY a suitable > way to do XXX?" No argument there. However, in the event the boss has already determined this to be the suitable way, (shall we also interrogate the boss to determine his/her qualifications to make that determination?), the OP's assignment is to find out how to get it done - some assignments are like that.
>>> - About all that PDFs are good for is page layout.
>> What about sharing documents with others without having to consider >> which version of MS Office they may be running or whether they even >> have Office running or whether their version of Open Office can read >> the newest Word document?
> I would recommend RTFs, but OOo tends to quickly munge these documents. You just identified an incompatibility that PDF's avoid.
> In general, a Word 95 document should be supported by anyone who cares. So, your advice is put it out there in that format and damn those who "don't care" ? I can see the OP going back to the boss saying "just put it out there in Word, Excel, Powerpoint format and f*** 'em if they can't take a joke."
>> Which leads me to wonder, what was the point of your response???
> To point out that there might be other means to solve the unstated core > problem than the way the OP has asked for. Why can't it be accepted that *maybe* the alternatives have been weighed and this is what the client needs?
> I read once in a guideline for asking questions that the second of these > two questions is preferred: [quoted text clipped - 3 lines] > "Hi, I am having a problem with my computer. I am getting random memory > errors, [etc.]. What may be causing these problems, and how would I check?" Or maybe, "Hi, I've diagnosed a problem with my computer and determined I need a new motherboard, can you advise me the best way to replace it?"
AL
Arne Vajhøj - 15 Jan 2008 01:59 GMT >> Did you dare to ask the point of this exercise? >> (Or are you just taking the money?*) > > I'm curious why you would consider this any of your business? If the OP want help with no questions asked then he should hire a consultant for 100 USD/h (or whatever).
If the OP want free help he will have to accept that people will ask question - maybe to better understand the problem, maybe because they have a similar problem, maybe because they are just curious.
> Maybe the OP's boss is one of those guys who is always thinking and > wondering about stuff like, "gee, I wonder if there's a way to..., hey [quoted text clipped - 3 lines] > to hell, you don't pay me enough to do that crap without a 30 page > RFI..." Oh, what a stellar employee you must be. The lack of applicability of your analogy to the situation here' says a bit about you as an employee.
Arne
Lew - 15 Jan 2008 02:30 GMT >>> Did you dare to ask the point of this exercise? >>> (Or are you just taking the money?*) [quoted text clipped - 8 lines] > because they have a similar problem, maybe because they are > just curious. One thing about free advice - no matter how bad it is, it's worth what you paid for it.
If one doesn't like Andrew's or anyone else's answers here, they're welcome to demand a refund of what they paid for them.
 Signature Lew
Arne Vajhøj - 15 Jan 2008 02:08 GMT > My boss asked me to research this question. He wants me to write a > script / program that will convert a directory of Word, Excel, and [quoted text clipped - 7 lines] > Does anyone know of a Java library that I could use for this? Or any > library in any language? PHP? Perl? I would go for whatever Microsoft and Adobe has to do this.
Sure you can find a Perl script somewhere that can convert 95% of the docs to readable but not very good looking PDF. And it will break with the next Word version. And the author is no longer maintaining it.
Arne
Roedy Green - 15 Jan 2008 10:55 GMT On Mon, 14 Jan 2008 05:31:05 -0800 (PST), Eeby <elektrophyte@yahoo.com> wrote, quoted or indirectly quoted someone who said :
>Does anyone know of a Java library that I could use for this? Or any >library in any language? PHP? Perl? see http://mindprod.com/jgloss/pdf.html
There are lots of links and also a link to Marco Schmidt's list of pdf links. You should find something in there.
 Signature Roedy Green, Canadian Mind Products The Java Glossary, http://mindprod.com
Lew - 15 Jan 2008 13:39 GMT > On Mon, 14 Jan 2008 05:31:05 -0800 (PST), Eeby > <elektrophyte@yahoo.com> wrote, quoted or indirectly quoted someone [quoted text clipped - 4 lines] > > see http://mindprod.com/jgloss/pdf.html iText is an excellent PDF-generation package for Java. It isn't for the "read MS Office" side of things, though. <http://www.lowagie.com/iText/>
 Signature Lew
Gordon Beaton - 15 Jan 2008 13:48 GMT > My boss asked me to research this question. He wants me to write a > script / program that will convert a directory of Word, Excel, and > PowerPoint documents into PDFs. If you are happy using OpenOffice to do the conversion (OO isn't 100% compatible), then maybe this will help:
http://www.togaware.com/linux/survivor/Convert_MS_Word.html
/gordon
--
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|