I am trying to parse a zipped XML file (open document spreadsheet). It is
composed of one long line of code.
The SAX parser takes character arrays of only 2048 characters. When a
character argument spans this break, the result is a second parser call to
characters(). The character data ends up being split into two components.
What can I do to fix this?
Here's the pertinent portion of my code:
ZipFile zf;
DefaultHandler handler = new ParseHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
try {
zf = new ZipFile(DATA_FILE_NAME);
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
handler);
} catch ...
TIA
>I am trying to parse a zipped XML file (open document spreadsheet). It is
> composed of one long line of code.
[quoted text clipped - 18 lines]
>
> TIA
In your implementation of characters(), you need to use a StringBuffer:
StringBuffer buf = new StringBuffer();
public void characters(char[] ch, int start, int length)
{
buf.append(ch, start, length);
}
Depending on the structure of the XML you're parsing, you may need to keep a
stack of StringBuffers or pull other tricks so that characters() picks up
the right StringBuffer to append to.
Duane Evenson - 27 May 2006 12:50 GMT
>>I am trying to parse a zipped XML file (open document spreadsheet). It is
>> composed of one long line of code.
[quoted text clipped - 30 lines]
> stack of StringBuffers or pull other tricks so that characters() picks up
> the right StringBuffer to append to.
This isn't the problem, or at least the solution. This would result in one
string buffer composed of all the spreadsheet cells concatenated together.
I want to process each cell separately.
Here is a code fragment from my program and the output:
public void characters(char buf[], int offset, int len)
throws SAXException {
String str = new String(buf, offset, len);
System.out.println("buf.length: " + buf.length + " offset: " + offset
+ " len: " + len + " str: "+ str);
}
# each call to characters should occur for each spreadsheet cell
buf.length: 2048 offset: 525 len: 10 str: 24/12/1999
buf.length: 2048 offset: 635 len: 10 str: Overwaitea
buf.length: 2048 offset: 726 len: 9 str: Groceries
buf.length: 2048 offset: 835 len: 4 str: 4.99
buf.length: 2048 offset: 920 len: 3 str: CAD
buf.length: 2048 offset: 1004 len: 8 str: BoM - MC
buf.length: 2048 offset: 1093 len: 1 str: x
buf.length: 2048 offset: 1175 len: 9 str: Groceries
buf.length: 2048 offset: 1265 len: 1 str: x
buf.length: 2048 offset: 1401 len: 1 str: x
buf.length: 2048 offset: 1570 len: 10 str: 30/12/1999
buf.length: 2048 offset: 1680 len: 7 str: Gas Bar
buf.length: 2048 offset: 1768 len: 3 str: Gas
buf.length: 2048 offset: 1872 len: 5 str: 10.51
buf.length: 2048 offset: 1958 len: 3 str: CAD
buf.length: 2048 offset: 2042 len: 6 str: BoM -
# Note how the string is split across calls to characters
# and how it happens at the end of the character array.
buf.length: 2048 offset: 0 len: 2 str: MC
buf.length: 2048 offset: 83 len: 1 str: x
buf.length: 2048 offset: 165 len: 3 str: Gas
...
I need to find some way to overcome this segmentation of the input data.
William Brogden - 27 May 2006 16:34 GMT
>>> I am trying to parse a zipped XML file (open document spreadsheet). It
>>> is
[quoted text clipped - 49 lines]
> + " len: " + len + " str: "+ str);
> }
The problem is that characters may be called more than once while
parsing a single element. You should create a StringBuffer on getting
startElement for the one you want to capture, concatenate all
characters calls to it and convert to String ONLY when you get
the endElement call.
The reason is that SAX parser will call characters when it reaches
the end of a bufferload so any element split over more than one
bufferload will get multiple calls.
> # each call to characters should occur for each spreadsheet cell
> buf.length: 2048 offset: 525 len: 10 str: 24/12/1999
[quoted text clipped - 21 lines]
>
> I need to find some way to overcome this segmentation of the input data.

Signature
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/