Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / March 2008

Tip: Looking for answers? Try searching our database.

Parsing HTML

Thread view: 
m_gallivan12@hotmail.com - 05 Mar 2008 00:42 GMT
I've Googled this for awhile and have come up relatively empty-
handed.  I've just begun to learn JAVA and I've decided to embark on
my first useful project - an HTML parser to retrieve names of players
on a website.

That being said, I have no idea where to begin.  There seems to be no
tutorials on the subject (for JAVA anyway) and the JAVA docs were not
of much help.

I was looking for something like javaParse.retrieveURL(URL) or some
kind of function, but is this too simplistic an approach?
Carl - 05 Mar 2008 01:04 GMT
> I've Googled this for awhile and have come up relatively empty-
> handed.  I've just begun to learn JAVA and I've decided to embark on
[quoted text clipped - 7 lines]
> I was looking for something like javaParse.retrieveURL(URL) or some
> kind of function, but is this too simplistic an approach?

http://java.sun.com/docs/books/tutorial/networking/urls/readingURL.html

Depending on your parsing requirements, you may also want to have a
look at:
http://jerichohtml.sourceforge.net/doc/index.html
The links at the bottom point to other HTML parsing libs.; for
a simple starter project though, you may be able to simply parse the
html using a regular expressions or string searching...

Hope that helps.
Mark Space - 05 Mar 2008 01:07 GMT
> I've Googled this for awhile and have come up relatively empty-
> handed.  I've just begun to learn JAVA and I've decided to embark on
[quoted text clipped - 7 lines]
> I was looking for something like javaParse.retrieveURL(URL) or some
> kind of function, but is this too simplistic an approach?

Well, I think a bit off, but not much.  Try here for URLs at least.  I'm
having trouble seeing any good, recent tutorials on parsing XML.

<http://java.sun.com/docs/books/tutorial/networking/urls/index.html>
m_gallivan12@hotmail.com - 05 Mar 2008 02:54 GMT
On Mar 4, 5:42 pm, m_galliva...@hotmail.com wrote:
> I've Googled this for awhile and have come up relatively empty-
> handed.  I've just begun to learn JAVA and I've decided to embark on
[quoted text clipped - 7 lines]
> I was looking for something like javaParse.retrieveURL(URL) or some
> kind of function, but is this too simplistic an approach?

Thanks for you replies!

Is there a similar way to parse JAVA/PHP/etc. and/or fill in forms?

Sorry for so many questions, are there any good sites for this sort of
thing (doesn't have to be JAVA specific) where I can just browse
through docs?
Mark Space - 05 Mar 2008 03:36 GMT
> Is there a similar way to parse JAVA/PHP/etc. and/or fill in forms?

For filling forms, perhaps a web architecture like Struts or JSF would
be what you are looking for?  Rather than parse the HTML, let the
container and various libraries do if for you.

http://www.javapassion.com/j2ee/

<http://www.javapassion.com/j2ee/#Struts_Basics>

<http://www.javapassion.com/j2ee/#Building_Hello_World_JSF_application>

> Sorry for so many questions, are there any good sites for this sort of
> thing (doesn't have to be JAVA specific) where I can just browse
> through docs?

See the full J2EE course in that first link above.  It's got pointers to
Sun's docs and tutorials with each lesson as appropriate.  It's a very
effective way to learn.

Now, is that what you want?  Or are you looking for something totally
different?
m_gallivan12@hotmail.com - 05 Mar 2008 04:30 GMT
> m_galliva...@hotmail.com wrote:
> > Is there a similar way to parse JAVA/PHP/etc. and/or fill in forms?
[quoted text clipped - 19 lines]
> Now, is that what you want?  Or are you looking for something totally
> different?

Honestly - I don't know.  That subject looks a bit too advanced for
me.  I think I've found an approach using Python which might allow me
to transfer some knowledge over to JAVA later on.  Thanks for your
time.
Mark Space - 05 Mar 2008 18:01 GMT
>> Now, is that what you want?  Or are you looking for something totally
>> different?
[quoted text clipped - 3 lines]
> to transfer some knowledge over to JAVA later on.  Thanks for your
> time.

Well, the basics of J2EE are pretty simple, but I think perhaps this
means you really are looking for something else.

Then I think the easiest XML parser in Java is XPath.  Lightweight on
the memory and doesn't require a lot of interface classes.  You can get
an XPath parser as easy as

XPath xpath = XPathFactory.newInstance().newXPath();

Then to parse

String result = xpath.evaluate("search/field", source );

For some variable name "source" that already exists.  I got this info
out of Learning Java btw, a book by O'Reilly.  I had to stop being lazy
and get up and get it off the bookshelf.  Sometimes a good book is
better than Google, and this is definitely one of those books.  I highly
recommend Learning Java.
Roedy Green - 06 Mar 2008 02:38 GMT
>Is there a similar way to parse JAVA/PHP/etc. and/or fill in forms?

1. you can parse the HTML to create the form with a parser, e.g.
JavaCC.  See http://mindprod.com/jgloss/parser.html
http://mindprod.com/jgloss/javacc.html
There will be no data in it.

2. You can parse the HTTP GET/POST message the broweser sends to your
servlet with Servlet methods.  To learn about the messages see
http://mindprod.com/jgloss/http.html
http://mindprod.com/jgloss/cgi.html
http://mindprod.com/jgloss/htmlcheat.html#FORMS
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Roedy Green - 06 Mar 2008 02:33 GMT
>I've Googled this for awhile and have come up relatively empty-
>handed.  I've just begun to learn JAVA and I've decided to embark on
[quoted text clipped - 4 lines]
>tutorials on the subject (for JAVA anyway) and the JAVA docs were not
>of much help.

I have written three special purpose HTML parsers. One is designed to
colorise it.  See http://mindprod.com/jgloss/jdisplay.html

the other is designed to compact it.  See
http://mindprod.com/products1.html#COMPACTOR

Both use finite state automatons.  See
http://mindprod.com/jgloss/finitestate.html

The third is very simple case loop.  It strips HTML tags.
See http://mindprod.com/products1.html#ENTITIES

I get pleasure writing these things, perhaps similar to the pleasure
others get from crossword puzzles.

Normally you would use some sort of parser.  See
http://mindprod.com/jgloss/parser.html

There are many grammars all done that you can use out the box or that
you can modify.

The biggest catch is dealing with imperfect HTML.  You might want to
run it through a Validator first.

See http://mindprod.com/jgloss/htmlvalidator.html

--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.