> Apologies in advance if my question is silly or trivial. I'm trying to
> write a servlet that reads data from another source in byte[] form and,
[quoted text clipped - 5 lines]
> certain string patterns in the data, and manipulate that information to
> generate a new data stream.
> Not a trivial question, exactly, but a bit on the
> vague side. It's difficult for me to tell what you're
> having difficultly with.
Sorry, my question *was* very vaguely worded.
> I'm going to assume that you're stuck getting started.
>
[quoted text clipped - 9 lines]
> byte of output from the object responsible for doing the
> search-and-replace manipulation.
My problem is that the conversion of the input stream into
character/String data doesn't give me anything meaningful - not enough
to parse and manipulate, at any rate. I suppose what I'm wondering is
whether there's any reference material that describes how an encoded
input stream of data (be it for Excel or PDF) can be "translated" into
a String representation in order to do basic String manipulations, and
then re-encoded and passed on to the next application.
Matt Humphrey - 29 Nov 2006 12:42 GMT
<snip>
>> To parse that a portion of a byte[] as a String, you can
>> just use the constructors in the String class.
[quoted text clipped - 11 lines]
> a String representation in order to do basic String manipulations, and
> then re-encoded and passed on to the next application.
Source data like Excel and GIFs don't have any natural string equivalent and
cannot be "parsed" in the sense of parsing strings. PDF is largely text but
may have some segments in binary--I don't know offhand how the binary parts
work. To "parse" true binary files you have to know the file structure.
You can go to http://www.wotsit.org/ to get information on file format.
Matt Humphrey matth@ivizNOSPAM.com http://www.iviz.com/
Mark Jeffcoat - 29 Nov 2006 16:03 GMT
> My problem is that the conversion of the input stream into
> character/String data doesn't give me anything meaningful - not enough
[quoted text clipped - 3 lines]
> a String representation in order to do basic String manipulations, and
> then re-encoded and passed on to the next application.
Yeah, okay. I gave you a strategy that will work if you've
got some Strings in an encoding you already understand surrounded
by other miscellaneous bytes that you can ignore; if that's
not the case (which it surely can be, if the binary format
is trying to be clever with how it stores text), you have
a harder problem.
The first thing I'd do is run the Unix program "strings"
(which you can surely find for Windows, if you have to) on
some of the files you're interested in, and see if you're
in the happy case. (It's sounds like you've already done
something like that, but a quick second opinion won't hurt.)
If not, you'll have to handle each format you want to
parse separately. I really like the POI library for handling
Excel documents in Java.
http://jakarta.apache.org/poi/
There is surely something similar for PDF, but I've
never had the need of it; your Google will be as
good as mine.

Signature
Mark Jeffcoat
Austin, TX
topcat.nyc@googlemail.com - 30 Nov 2006 08:55 GMT
> Yeah, okay. I gave you a strategy that will work if you've
> got some Strings in an encoding you already understand surrounded
> by other miscellaneous bytes that you can ignore; if that's
> not the case (which it surely can be, if the binary format
> is trying to be clever with how it stores text), you have
> a harder problem.
I figured out what the problem with the PDF data was. The binary stream
that I read in gives me PDF data in compressed form, which I discovered
after running a few tests on it. I downloaded a free tool, pdftk, to
help me uncompress the source data stream, perform my text
manipulations, and then recompress the modified data before passing
them on.
Thanks for your help, guys! I really appreciate it.
- tc