Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / September 2004

Tip: Looking for answers? Try searching our database.

quick and easy way to parse XML

Thread view: 
luca passani - 27 Sep 2004 17:21 GMT
I am building a servlet that needs to parse XHTML files (with DTD and everything),
in order to figure out the link to the pictures (<img src="getmeifyoucan.gif" />)

I thought I had already solved the problem elegantly when I realized that
the package to parse XML would automatically open a connection to
the a website on the internet to retrieve the DTD!
Since this happens at every request to to the servlets, this behaviour
is unacceptable for my application.

Apparently, there is no simple way to disable this behavior, since
the XML spec demands that the DTD is retrieved.
I tried to treat the XML as a string and remove the DTD reference,
but, unfortunately, the library will fail if an entity is encountered
(&nbsp; for example).

I am puzzled. If I treat the XML as a string, String methods and
regexps are hardly powerful enough to achieve the task.
On the other hand, XML parsing turns up to introduce
even more problems than I am trying to solve (as an aside, wasn't
XML supposed to be simple?)

Is there an easy way to achieve my goal?  XML parsing or regexps?

thanks

Luca
Paul Lutus - 27 Sep 2004 17:33 GMT
> I am building a servlet that needs to parse XHTML files (with DTD and
> everything), in order to figure out the link to the pictures (<img
[quoted text clipped - 14 lines]
> I am puzzled. If I treat the XML as a string, String methods and
> regexps are hardly powerful enough to achieve the task.

On the contrary. Those methods are more than powerful enough to handle the
described task. How do I know? That is how the class responsible for this
task does it.

> On the other hand, XML parsing turns up to introduce
> even more problems than I am trying to solve

Name them.

> (as an aside, wasn't
> XML supposed to be simple?)

No, that is a myth. XML is supposed to eliminate unnecessary duplication and
provide a way to standardize data structures. If the data structures are
complex, so is the XML representation.

> Is there an easy way to achieve my goal?  XML parsing or regexps?

What can I say? Yes? XML parsing and regular expressions seem to be part of
the same topic.

Signature

Paul Lutus
http://www.arachnoid.com

luca - 28 Sep 2004 10:24 GMT
 > On the contrary. Those methods are more than powerful enough to handle the
> described task. How do I know? That is how the class responsible for this
> task does it.

Which class? How do you handle:

<img
 src="pippo"
/>

>>On the other hand, XML parsing turns up to introduce
>>even more problems than I am trying to solve
>
> Name them.

I just did. The stupid parser try to open an HTTP connection to retrieve the DTD

>>(as an aside, wasn't
>>XML supposed to be simple?)
>
> No, that is a myth.

If this is a myth, it is one that the XML industry has contributed
to fuel. Have a look at the first line of:

http://www.w3.org/XML/

"Extensible Markup Language (XML) is a simple, very flexible text format
derived from SGML"

> XML is supposed to eliminate unnecessary duplication and
> provide a way to standardize data structures. If the data structures are
> complex, so is the XML representation.

the problem is that a lot of complexity is there also for mega-simple stuff.

>>Is there an easy way to achieve my goal?  XML parsing or regexps?
>
> What can I say? Yes? XML parsing and regular expressions seem to be part of
> the same topic.

So, how do you handle:

<img
 src="pippo"
/>

with regexps in Java?

Luca
Stefan Schulz - 28 Sep 2004 12:42 GMT
>>> On the other hand, XML parsing turns up to introduce
>>> even more problems than I am trying to solve
>>   Name them.
>
> I just did. The stupid parser try to open an HTTP connection to retrieve  
> the DTD

Use a non-validating Parser then. :)

Signature

Whom the gods wish to destroy they first call promising.

luca - 28 Sep 2004 14:46 GMT
> Use a non-validating Parser then. :)

Which one? even SAX goes for the DTD!!!

Also, be careful, because what I found out by discussing
with XML gurus is that even non-validating parsers are required
to go after the DTD if they see one according to XML specs!!!!!

Luca
Tor Iver Wilhelmsen - 28 Sep 2004 19:06 GMT
> Which one? even SAX goes for the DTD!!!

What does your DOCTYPE look like?

Try setting a custom EntityResolver that doesn't return null:

http://www.saxproject.org/apidoc/org/xml/sax/EntityResolver.html
Stefan Schulz - 28 Sep 2004 12:55 GMT
> So, how do you handle:
>
> <img
>   src="pippo"
> />

From the top of my head:

"<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match pretty
any img tag that has no alts, height etc attributes. How to add them...  
look at the
alternative Operator (It is the | )

Signature

Whom the gods wish to destroy they first call promising.

luca - 28 Sep 2004 14:49 GMT
>  From the top of my head:
>
> "<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match pretty
> any img tag that has no alts, height etc attributes. How to add them...  
> look at the
> alternative Operator (It is the | )

but this is not good enough for me (this is why I went for XML parsing
in the first place). All I know about my mark-up is that it's well-formed,
but I don't know anything about the order or the availability
of other attributes:

<img
 src="pippo"
/>

<img alt="pippo"
  height="25"
 src="pippo"
/>

<img src="pippo" alt="pippo"
  height="25" />

<img height="35" src="pippo" />

this are all good. BTW the XML guys claimed confidently that RegExps
are, generally speaking, not powerful enough to parse XML!

Luca
Stefan Schulz - 28 Sep 2004 15:14 GMT
>>  From the top of my head:
>>  "<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match  
[quoted text clipped - 8 lines]
> but I don't know anything about the order or the availability
> of other attributes:

Well, in that case do what i said: Within the tag, make an alternative of  
all the
possible attributes (refer to the DTD for the List of allowed attributes).

> this are all good. BTW the XML guys claimed confidently that RegExps
> are, generally speaking, not powerful enough to parse XML!

Generally speaking, this is true. In this particular case, you can however  
do it,
since the only thing XML can do that Regular expressions can not is build  
trees.

img tags, however, are necessarily leaves on the document tree.

Signature

Whom the gods wish to destroy they first call promising.

Daniel Sjöblom - 28 Sep 2004 16:00 GMT
> BTW the XML guys claimed confidently that RegExps
> are, generally speaking, not powerful enough to parse XML!

They aren't. Parsing XML requires a stack (or more precisely, the parser
needs to remember all the previous states that led to the current
state.) Regular languages can be parsed without remembering state.

However, some of the available regular expression packages contain
constructs that are quite a bit more powerful than real regular expressions.

Signature

Daniel Sjöblom
Remove _NOSPAM to reply by mail

Paul Lutus - 28 Sep 2004 16:51 GMT
>   > On the contrary. Those methods are more than powerful enough to handle
>   > the
[quoted text clipped - 6 lines]
>   src="pippo"
> />

To what does "you" refer? Existing classes, or your own classes? The answer
in both cases is "easily", but that is beside the point.

>>>On the other hand, XML parsing turns up to introduce
>>>even more problems than I am trying to solve
[quoted text clipped - 3 lines]
> I just did. The stupid parser try to open an HTTP connection to retrieve
> the DTD

How long will this take? I already told you -- write your own parsing class.

>>>(as an aside, wasn't
>>>XML supposed to be simple?)
[quoted text clipped - 8 lines]
> "Extensible Markup Language (XML) is a simple, very flexible text format
> derived from SGML"

And simple languages can be used to convey complex ideas. If that were not
true, the language would be abandoned.

>> XML is supposed to eliminate unnecessary duplication and
>> provide a way to standardize data structures. If the data structures are
>> complex, so is the XML representation.
>
> the problem is that a lot of complexity is there also for mega-simple
> stuff.

No, not really. Simple tasks can be handled using simple XML. Complex tasts
require complex XML.

>>>Is there an easy way to achieve my goal?  XML parsing or regexps?
>>
[quoted text clipped - 8 lines]
>
> with regexps in Java?

Trivially:

String result = original.replaceAll("\\n+"," ");

Working example:

public class Test {
 
 
  public static void main(String[]args)
  {
     String a = "<img\n"
     + "src=\"pippo\"\n"
     + "/>";
     String b = a.replaceAll("\\n+"," ");
     System.out.println(a + " -> " + b);
  }
}

Result:

<img
src="pippo"
/> -> <img src="pippo" />

Wow, that was really hard!

Signature

Paul Lutus
http://www.arachnoid.com

jmm-list-gn - 28 Sep 2004 19:12 GMT
> Which class? How do you handle:
>
> <img
>  src="pippo"
> />

  Use the getAttribute() method for Element (org.w3c.dom).

Signature

jmm dash list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.