Java Forum / General / May 2007
how to extract info from a huge size xml ?
sxshu02@gmail.com - 11 May 2007 18:03 GMT Sorry ,everyone ,i'm new to Java. If i wannt first search with the xml file to get the right position , then extract the corresponding xml item after that . how can i do that ? I maybe be short of memory usage. also the performance should be good. I checked that the SAX might do it good , but the efficiency may be not good.
Thanks for helping !
Raghav - 11 May 2007 19:23 GMT On May 11, 10:03 pm, "sxsh...@gmail.com" <sxsh...@gmail.com> wrote:
> Sorry ,everyone ,i'm new to Java. > If i wannt first search with the xml file to get the right [quoted text clipped - 6 lines] > > Thanks for helping ! You cannot store the context with SAX. You need to use a DOM parser and build the DOM in memory. To get to the right node, you might want to find the XPath of that node, which will be a String. There are 3p libraries to get the node at an XPath.
You can as well write your own traversal class using NodeIterator or TreeWalker. HTH.
Lew - 12 May 2007 00:08 GMT > On May 11, 10:03 pm, "sxsh...@gmail.com" <sxsh...@gmail.com> wrote: >> Sorry ,everyone ,i'm new to Java. [quoted text clipped - 5 lines] >> I checked that the SAX might do it good , but the efficiency may be >> not good. SAX is incredibly efficient.
> You cannot store the context with SAX. You need to use a DOM parser > and build the DOM in memory. Not true. I've built many a SAX parser that kept track of context and didn't need to keep everything in memory all at once.
 Signature Lew
Seashor - 12 May 2007 15:21 GMT I have considered the DOM, but I've been told that it "eats" the memory . You think it works? Raghav 写道:
> On May 11, 10:03 pm, "sxsh...@gmail.com" <sxsh...@gmail.com> wrote: > > Sorry ,everyone ,i'm new to Java. [quoted text clipped - 17 lines] > TreeWalker. > HTH. Lew - 12 May 2007 19:01 GMT > I have considered the DOM, but I've been told that it "eats" the > memory . > You think it works? Please do not top-post (placement of reply above material quoted).
DOM does indeed "eat" memory, more so the larger the document. SAX is fast, efficient and can be coded to be very parsimonious of memory.
And it does not have to lose context, despite misinformation provided earlier.
You should use SAX or StAX.
-- Lew
Seashor - 14 May 2007 09:12 GMT > > I have considered the DOM, but I've been told that it "eats" the > > memory . [quoted text clipped - 10 lines] > > -- Lew Sorry , top-post is a hahit where china forums have . I'm also a new here.
Raghav - 14 May 2007 07:13 GMT Hi Seashor, DOM is memory intensive coz it builds the whole tree in memory. You can use a SAX parser today but in case you intend to retrieve multiple nodes in future, your program becomes clumsy and maintenance becomes an issue.
On the other hand, if you have a DOM, you can pass a collection of XPaths and retrieve the corresponding nodes using a DOM. Looking at performance, its better to use SAX but if you can think of some changes in reqs in future, its safer to use DOM.
> I have considered the DOM, but I've been told that it "eats" the > memory . [quoted text clipped - 22 lines] > > TreeWalker. > > HTH. Seashor - 14 May 2007 09:14 GMT > Hi Seashor, > DOM is memory intensive coz it builds the whole tree in memory. [quoted text clipped - 33 lines] > > > TreeWalker. > > > HTH. Thanks for advising. I'm thinking of doing it via SAX. Although it's a stream process method, I'll make some file index , hope it can do well
Lew - 14 May 2007 21:27 GMT > Thanks for advising. I'm thinking of doing it via SAX. > Although it's a stream process method, I'll make some file index , > hope it can do well File index? If you mean a numeric offset of character positions into the file, that could make your solution much more complex.
Just chain together polymorphic implementations of tag Handlers that are invoked on each tag entry. Have each one hold a reference to its enclosing-tag handler so you can pop it back into "currentHandler" on the tag exit.
XML is a strange bedfellow with file offsets. It's far, far better to stay within XML semantics when doing XML processing.
Just to hint at the SAX way, which nowadays is a bit old-fashioned in favor of StAX and things like the XMLStreadReader, you could use a ContentHandler for each tag:
<foo> <person> <name>John Doe</name> </person> </foo>
You would declare an abstract FooHandler class that implements ContentHandler, and has child classes for each tag, "foo", "person", "name", etc.
public abstract class AbstractFooHandler extends DefaultHandler { public static final class Context { XMLReader parser; }
private Context context; public final Context getContext() { return context; } public final void setContext( Context ctx ) { this.context = ctx; }
private AbstractFooHandler encloser; protected final AbstractFooHandler getEncloser() { return encloser; } protected final void setEncloser( AbstractFooHandler fh ) { this.encloser = fh; } }
public class FooParser { public static void main( String [] args ) { XMLReader parser = XMLReaderFactory.createXMLReader();
InputSource is = createInputSource( args ); // however you do it
AbstractFooHandler.Context ctx = new AbstractFooHandler.Context(); ctx.parser = parser;
AbstractFooHandler fh = new FooHandler(); fh.setContext( ctx );
parser.setContentHandler( fh ); parser.parse( is ); } }
public class FooHandler extends AbstractFooHandler { public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { if ( localName.equals( "person" )) { AbstractFooHandler afh = new PersonHandler(); afh.setContext( getContext() ); afh.setEncloser( this ); getContext().parser.setContentHandler( afh ); } else { throw new SAXException( "Illegal tag \""+ localName +"\"." ); } } }
Then endElement() callback of PersonHandler would detect the closing "person" tag and replace the current Handler with its own encloser. endElement() at every level will emit events that you want to happen in response to the XML.
I hard-coded a few things in this example, which is a Bad Thing but would have been too long in a newsgroup post. I'd keep a Map of Handlers keyed by tags instead of hardcoding the tag and its handler. This is most definitely not an SSCCE.
This will let you keep track of where you are and process your file in one pass, keeping in memory only what each handler emits as necessary to keep in memory. No file offsets, either.
 Signature Lew
Seashor - 15 May 2007 09:50 GMT > > Thanks for advising. I'm thinking of doing it via SAX. > > Although it's a stream process method, I'll make some file index , [quoted text clipped - 111 lines] > -- > Lew Thanks a lot! It helps me so much.
Philipp Taprogge - 11 May 2007 21:05 GMT Hi!
> I checked that the SAX might do it good , but the efficiency may be > not good. An alternative approach could be StAX [JSR173]. It is a stream-based XML api that is basically event-driven, allowing you to sort of "react" to things like encountering a begin or end tag. One implementation is woodstox, that can be found at http://woodstox.codehaus.org/
HTH,
Phil
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|