Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / April 2006

Tip: Looking for answers? Try searching our database.

Suggestions on Parsing HTML

Thread view: 
burgermeister01@gmail.com - 10 Apr 2006 03:15 GMT
Hi,

I'm working on a school project, and I was hoping to get some
suggestions from the group. As part of a project I need a program to be
able to go to dictionary.com and look up a word a user specifies and
return the definition. So far I've figured out how to pull data from a
URL, and I can get a page's HTML code no problem. The next step, which
is displaying the text is what is given me a problem. How can I rip out
just the HTML that I want and leave all the rest behind? So far, my
best idea is just to use some very clever and maticulous text parsing,
but that seems tedious and unreliable (what if dictionary.com makes a
change to their HTML code?). Is there an easier way that I don't know
of? Keep in mind that I have to be able to display this text to a
command line and a GUI so if Java has some kind of built-in HTML
reader, that would only half work.
Roedy Green - 10 Apr 2006 03:53 GMT
> The next step, which
>is displaying the text is what is given me a problem. How can I rip out
>just the HTML that I want and leave all the rest behind?

tools you have are indexOf, substring and regex.
you also might look at http://mindprod.com/products1.html#ENTITIES
to get rid of tags and entities.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

burgermeister01@gmail.com - 10 Apr 2006 21:47 GMT
Thanks, that library looks as though it's really going to make my life
easier. Also just in case by some fluke, somone is looking to do the
same thing as me, it seems as though merriam-websters's website is
easier to work with. Secondly, I expect String.split to be useful in
addition to indexOf, etc.
Oliver Wong - 10 Apr 2006 22:39 GMT
> Thanks, that library looks as though it's really going to make my life
> easier. Also just in case by some fluke, somone is looking to do the
> same thing as me, it seems as though merriam-websters's website is
> easier to work with. Secondly, I expect String.split to be useful in
> addition to indexOf, etc.

   If you're open to alternative dictionaries, look for one with an open
API. I know Gnome has a widget that allows you to place a dictionary in the
toolbar. You might want to find out which API they're using and use it as
well. You might be able to avoid dealing with HTML altogether if you use an
API (you'd be dealing with XML instead), and the service provider is less
likely to change the HTML formatting if they've published the API openly.

   Another thing you might try is using the Google websearch API. In
"normal" Google, if you prefix a search query with "define:", you'll get the
definition of the word, instead of pages which contain the word as keywords.
E.g. "define:dogma" gives you definitions form the word "dogma". Maybe this
facility is also accessible via Google's search API. The API devloper kit
contains sample programs in Java.

http://www.google.com/apis/

   - Oliver
Jon Martin Solaas - 10 Apr 2006 07:55 GMT
> Hi,
>
[quoted text clipped - 11 lines]
> command line and a GUI so if Java has some kind of built-in HTML
> reader, that would only half work.

There is a simple html parser in Swing (of all places ...). More
advanced exist for sure, but it's easy to use and exist in the runtime
library.

http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html

If the webpages change you're still stuck. Maybe the site has some
interface for 3d parties?


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.