Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2007

Tip: Looking for answers? Try searching our database.

Removing elements from large XML documents

Thread view: 
Jakub Moskal - 28 Mar 2007 20:17 GMT
Hi,

I need to remove certain elements from the XML document tree based on
given parameters, e.g. I have a document with a structure as follows:

<country>
 <city>
   <street name="streetName" />
 </city>
</country>

and I want to remove all <country> nodes for which the street name is
"someName" (I know the example is lame, but it exposes my problem).

Initially I used DOM and whenever I found <street> element with the
name attribute that I don't want, I removed such country using:
root.removeChild(node.getParent().getParent().getParent())).

It worked just fine with small files, but problems occurred when I
started dealing with docs that are 10-60MB in size. DOM loads the
entire document tree into the memory and this solution doesn't scale
at all - on most computers I get memory issues. I don't want to go
into giving JVM more memory, because I don't feel that this is the
direction in which I should go about it - it's not a universal
solution.

SAX parses the document in a serial fashion, I can't find a way to
remove the great-grand-node of the current element with it. Processing
XSLT works similar to DOM and memory issues occur.

Is there anything else out there that would help me solve this issue?
Would chopping the file into smaller pieces be a good solution?

Any help greatly appreciated,
Jakub.
Tom Hawtin - 28 Mar 2007 20:55 GMT
> SAX parses the document in a serial fashion, I can't find a way to
> remove the great-grand-node of the current element with it. Processing
> XSLT works similar to DOM and memory issues occur.

(Strictly whether XSLT uses a DOM is implementation dependent. There was
some talk of making Xalan work in a streaming mode several years ago,
but XSLT isn't seen as sexy as it once was.)

My suggestion is that when you hit a <country> element, you switch to a
temporary stream (StringWriter, say). When you find a <street> element
you don't want, you switch the output to a null stream. At the end of
the </country> element (or before) write the temporary stream to the
real output stream, and switch back.

(I suggest not using RandomAccessFile to jump backwards, as it is
excessively slow.)

Tom Hawtin


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.