Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / August 2006

Tip: Looking for answers? Try searching our database.

extract text from a PDF file with JAVA

Thread view: 
Sergio - 02 Aug 2006 15:19 GMT
Hi to all the newsgroup, this is my first post.
I'm approaching the text retrieving from PDF files with java.
I'm looking for some example code, tutorial, guide or similar.

I'm using, for the moment, PDFBox library but i notice a lot of errors
in its PDF parsing.
So i've tried with "Pjx" library and i've found a good example code in
this site:
http://www.jguru.com/faq/view.jsp?EID=1074237
...but i can't find a way to call "PdfParser.getContents()"  method.

I will appreciate any advice.
Thanks in advance.

Sergio.
Oliver Wong - 02 Aug 2006 17:31 GMT
> So i've tried with "Pjx" library and i've found a good example code in
> this site:
> http://www.jguru.com/faq/view.jsp?EID=1074237
> ...but i can't find a way to call "PdfParser.getContents()"  method.

   How can you "not find a way" to call a specific method? What did you
type and what error message was produced?

   - Oliver
Lars Enderin - 02 Aug 2006 17:47 GMT
Oliver Wong skrev:

>> So i've tried with "Pjx" library and i've found a good example code in
>> this site:
[quoted text clipped - 3 lines]
>    How can you "not find a way" to call a specific method? What did you
> type and what error message was produced?

The method is declared private. It's not supposed to be called from
outside the class.
Sergio - 02 Aug 2006 18:14 GMT
Lars Enderin ha scritto:

> The method is declared private. It's not supposed to be called from
> outside the class.

first af all thanks for the answers.
i've made that method public before calling it.
My procedure's call is this (very simple):

File f = new File("sample.pdf");
String text = new String();
PdfParser p = new PdfParser();
Document doc = p.parse(f);
text = p.getContents();

These the errors displayed on the console:

Exception in thread "main" java.lang.ClassCastException:
java.lang.String
    at com.etymon.pj.PdfParser.parse(PdfParser.java:427)
    at com.etymon.pj.PdfParser.getNextXref(PdfParser.java:67)
    at com.etymon.pj.PdfParser.getXref(PdfParser.java:57)
    at com.etymon.pj.PdfParser.getObjects(PdfParser.java:12)
    at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227)
    at com.etymon.pj.Pdf.<init>(Pdf.java:32)
    at PdfParser.getContents(PdfParser.java:82)
    at PdfParser.parse(PdfParser.java:47)
    at PdfParser.parse(PdfParser.java:29)
    at Prova.main(Prova.java:31)

Thanks in advance for your interest.

Sergio.
Oliver Wong - 02 Aug 2006 18:34 GMT
> Lars Enderin ha scritto:
>
[quoted text clipped - 27 lines]
>
> Thanks in advance for your interest.

   Please show the parse method of the file com.etymon.pj.PdfParser. Be
sure to include line 427.

   - Oliver
Sergio - 02 Aug 2006 19:38 GMT
>     Please show the parse method of the file com.etymon.pj.PdfParser. Be
> sure to include line 427.
>
>     - Oliver

As you've requested here is the parse method of the file
com.etymon.pj.PdfParser.
It's quite long...the line 427 is the return instruction at the end of
method.
Thanks again.

    public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
xref, byte[] data, int start)
        throws IOException, PjException {
        PdfParserState state = new PdfParserState();
        state._data = data;
        state._pos = start;
        state._stream = -1;
        Stack stack = new Stack();
        boolean endFlag = false;
        while ( ( ! endFlag ) && (getToken(state)) ) {
            if (state._stream != -1) {
                stack.push(state._streamToken);
                state._stream = -1;
            }
            else if (state._token.equals("startxref")) {
                endFlag = true;
            }
            else if (state._token.equals("endobj")) {
                endFlag = true;
            }
            else if (state._token.equals("%%EOF")) {
                endFlag = true;
            }
            else if (state._token.equals("endstream")) {
                byte[] stream = (byte[])(stack.pop());
                PjStreamDictionary pjsd = new PjStreamDictionary(
                    ((PjDictionary)(stack.pop())).getHashtable());
                PjStream pjs = new PjStream(pjsd, stream);
                stack.push(pjs);
            }
            else if (state._token.equals("stream")) {
                // get length of stream
                PjObject obj = ((PjObject)(
                    (((PjDictionary)(stack.peek())).
                    getHashtable().
                           get(new PjName("Length")))));
                if (obj instanceof PjReference) {
                    obj = getObject(pdf, raf, xref,
                            ((PjReference)(obj)).getObjNumber().getInt());
                }
                state._stream =
                    ((PjNumber)(obj)).getInt();

                // the following if() clause added to
                // handle the case of "Length" being
                // incorrect (larger than the actual
                // stream length)
                if ( state._stream >
                    (state._data.length - state._pos)
                    ) {
                    state._stream =
                        state._data.length -
                        state._pos - 17;
                }

                if (state._pos < state._data.length) {
                    if ((char)(state._data[state._pos]) == '\r') {
                        state._pos++;
                    }
                    if ( (state._pos < state._data.length) &&
                        ((char)(state._data[state._pos]) ==
                         '\n') ) {
                        state._pos++;
                    }
                }
            }
            else if (state._token.equals("null")) {
                stack.push(new PjNull());
            }
            else if (state._token.equals("true")) {
                stack.push(new PjBoolean(true));
            }
            else if (state._token.equals("false")) {
                stack.push(new PjBoolean(false));
            }
            else if (state._token.equals("R")) {
                // we ignore the generation number
                // because all objects get reset to
                // generation 0 when we collapse the
                // incremental updates
                stack.pop();  // the generation number
                PjNumber obj = (PjNumber)(stack.pop());
                stack.push(new PjReference(obj, PjNumber.ZERO));
            }
            else if ( (state._token.charAt(0) == '<') &&
                 (state._token.startsWith("<<") == false) ) {
                stack.push(new PjString(PjString.decodePdf(state._token)));
            }
            else if (
                (Character.isDigit(state._token.charAt(0)))
                || (state._token.charAt(0) == '-')
                || (state._token.charAt(0) == '.') ) {
                stack.push(new PjNumber(new Float(state._token).floatValue()));
            }
            else if (state._token.charAt(0) == '(') {
                stack.push(new PjString(PjString.decodePdf(state._token)));
            }
            else if (state._token.charAt(0) == '/') {
                stack.push(new PjName(state._token.substring(1)));
            }
            else if (state._token.equals(">>")) {
                boolean done = false;
                Object obj;
                Hashtable h = new Hashtable();
                while ( ! done ) {
                    obj = stack.pop();
                    if ( (obj instanceof String) &&
                        (((String)obj).equals("<<")) ) {
                        done = true;
                    } else {
                        h.put((PjName)(stack.pop()),
                             (PjObject)obj);
                    }
                }
                // figure out what kind of dictionary we have
                PjDictionary dictionary = new PjDictionary(h);
                if (PjPage.isLike(dictionary)) {
                    stack.push(new PjPage(h));
                }
                else if (PjPages.isLike(dictionary)) {
                    stack.push(new PjPages(h));
                }
                else if (PjFontType1.isLike(dictionary)) {
                    stack.push(new PjFontType1(h));
                }
                else if (PjFontDescriptor.isLike(dictionary)) {
                    stack.push(new PjFontDescriptor(h));
                }
                else if (PjResources.isLike(dictionary)) {
                    stack.push(new PjResources(h));
                }
                else if (PjCatalog.isLike(dictionary)) {
                    stack.push(new PjCatalog(h));
                }
                else if (PjInfo.isLike(dictionary)) {
                    stack.push(new PjInfo(h));
                }
                else if (PjEncoding.isLike(dictionary)) {
                    stack.push(new PjEncoding(h));
                }
                else {
                    stack.push(dictionary);
                }
            }
            else if (state._token.equals("]")) {
                boolean done = false;
                Object obj;
                Vector v = new Vector();
                while ( ! done ) {
                    obj = stack.pop();
                    if ( (obj instanceof String) &&
                        (((String)obj).equals("[")) ) {
                        done = true;
                    } else {
                        v.insertElementAt((PjObject)obj, 0);
                    }
                }
                // figure out what kind of array we have
                PjArray array = new PjArray(v);
                if (PjRectangle.isLike(array)) {
                    stack.push(new PjRectangle(v));
                }
                else if (PjProcSet.isLike(array)) {
                    stack.push(new PjProcSet(v));
                }
                else {
                    stack.push(array);
                }
            }
            else if (state._token.startsWith("%")) {
                // do nothing
            }
            else {
                stack.push(state._token);
            }
        }
    /*line 427*/    return (PjObject)(stack.pop());
    }
Sergio - 02 Aug 2006 19:51 GMT
I've uploaded the pjx library to this site
http://rapidshare.de/files/27945483/pjx-1.4.0.jar.html
I think it could be useful.
Thanks for all your help.
Sergio.
Oliver Wong - 02 Aug 2006 20:01 GMT
[OP has a CastClassException on line 427, actual class type is String]

>>     Please show the parse method of the file com.etymon.pj.PdfParser. Be
>> sure to include line 427.
[quoted text clipped - 9 lines]
> public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
> xref, byte[] data, int start)
[...]
> Stack stack = new Stack();
[...]
> stack.push(state._streamToken);
[...]
> byte[] stream = (byte[])(stack.pop());
> PjStreamDictionary pjsd = new PjStreamDictionary(
> ((PjDictionary)(stack.pop())).getHashtable());
> PjStream pjs = new PjStream(pjsd, stream);
> stack.push(pjs);
[...]
> /*line 427*/ return (PjObject)(stack.pop());

   This code is extremely messy in that it pops all sorts of different type
objects into the stack object. I wouldn't be surprised if this were
generated code instead of hand written.

   If this is your code, you've got a bug and you need to fix it. If it's
someone else's code, then you should write up an SSCCE demonstrating the bug
and submit it to then. See http://mindprod.com/jgloss/sscce.html

   - Oliver
Sergio - 02 Aug 2006 20:48 GMT
Oliver Wong ha scritto:

>     This code is extremely messy in that it pops all sorts of different type
> objects into the stack object. I wouldn't be surprised if this were
[quoted text clipped - 3 lines]
> someone else's code, then you should write up an SSCCE demonstrating the bug
> and submit it to then. See http://mindprod.com/jgloss/sscce.html

the code of parse method is from pjx library...the only code i've wrote
is the calling method and i think the problem is in that procedure.
Thanks for your help.
Sergio.
Chris Uppal - 03 Aug 2006 08:51 GMT
> i've made that method public before calling it.

And you are surprised to find that it doesn't work ?

Presumably the author made that method private for a reason -- for instance it
may depend on certain kinds of initialisation being done first.  Why not
explore the library for the /correct/ way to use it for what you want.  If you
find there isn't a way, then you could drop a line to the author suggesting an
enhancement -- which would probably be more welcome if you can supply /working/
code too.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.