Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2004

Tip: Looking for answers? Try searching our database.

Java HTML Parser

Thread view: 
Anony! - 22 Jul 2004 08:48 GMT
Hi

2 questions:

1. I'm looking for a Java HTML parser. I realize that the Java Swing HTML
parser is one option I could use, but I would like some other
opinions/alternatives.

2. I am hoping to parse a batch of HTML Web pages. I believe it should be
relatively easy to do a single HTML page, but any tips for multiple HTML
pages? how will the parser know to go to the next HTML page? I have like
thousands of HTML pages to parse.

Any help appreciated.

Regards
AaA
Luca Paganelli - 22 Jul 2004 09:38 GMT
> 2. I am hoping to parse a batch of HTML Web pages. I
believe it should be
> relatively easy to do a single HTML page, but any tips
for multiple HTML
> pages? how will the parser know to go to the next HTML
page? I have like
> thousands of HTML pages to parse.

I don't think the parse would
go to 'next HTML pages'
automatically.
Anyway you can look for any
linked page in the parsed
document and
then start parsing those new
pages.

Luca Paganelli
Anony! - 22 Jul 2004 09:45 GMT
>>"Luca Paganelli wrote
> > 2. I am hoping to parse a
[quoted text clipped - 19 lines]
>
> Luca Paganelli

There are no links in those HTML pages. And yes I want something that will
automatically parse the next HTML file in a given directory.

AaA
Markus Schaber - 22 Jul 2004 11:48 GMT
Hi, Anony,

> There are no links in those HTML pages. And yes I want something that
> will automatically parse the next HTML file in a given directory.

Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.

Gruss,
Markus

Signature

markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Anony! - 22 Jul 2004 12:35 GMT
"Markus Schaber" <individual-news@schabi.de> wrote in message
Hi, Anony,

On Thu, 22 Jul 2004 08:45:12 GMT
"Anony!" <someone@something.com> wrote:

> There are no links in those HTML pages. And yes I want something that
> will automatically parse the next HTML file in a given directory.

Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.

Gruss,
Markus

You mean store the files in a tree structure? and iterate through it?

AaA
Markus Schaber - 22 Jul 2004 13:36 GMT
Hi, Anony,

>> > There are no links in those HTML pages. And yes I want something
>> > that will automatically parse the next HTML file in a given
[quoted text clipped - 4 lines]
>
> You mean store the files in a tree structure? and iterate through it?

Why a tree structure?

You create a File object on the Directory, (isDirectory() should be true
then), and use listFiles(filter) to get a List of all files of this
Directory, then you can pass each of them to your html parser.

Markus

Signature

markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Rogan Dawes - 23 Jul 2004 07:21 GMT
> Hi
>
[quoted text clipped - 3 lines]
> parser is one option I could use, but I would like some other
> opinions/alternatives.

have a look at htmlparser on sourceforge.net
(http://htmlparser.sourceforge.net), which is probably more robust than
the standard Sun parser.

> 2. I am hoping to parse a batch of HTML Web pages. I believe it should be
> relatively easy to do a single HTML page, but any tips for multiple HTML
> pages? how will the parser know to go to the next HTML page? I have like
> thousands of HTML pages to parse.

Either you have a list of the pages/URLs that you provide to the parser,
or you parse additional URL's from the pages as you read them. As you
said in another response in this thread that the pages will not have
links to other pages, you must then have a list yourself.

Clearly, your computer cannot simply "guess" which pages to parse. If
the pages are stored locally, simply iterate over the directory(ies) in
which they are stored, parsing them one by one. If the pages are stored
on a server, perhaps there is an index page that you can parse to get a
list of pages.

Rogan
Signature

Rogan Dawes

*ALL* messages to discard@dawes.za.net will be dropped, and added
to my blacklist. Please respond to "nntp AT dawes DOT za DOT net"

Anony! - 23 Jul 2004 09:52 GMT
> > Hi
> >
[quoted text clipped - 23 lines]
> on a server, perhaps there is an index page that you can parse to get a
> list of pages.

Let me describe what I am trying to parse in greater detail.

I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
point to another page with a list of hyperlinks. Each of these links point
to a unique page I want to parse. Anyone lost yet? I don't know how to
download all of these pages in an automated fashion for parsing.

I think I can handle storing these files in a file system recognise in Java
and then parsing each of these files in the file system. Its the automated
download of all these webpages that makes me clueless.

Any help appreciated.

AaA
Andrew Thompson - 23 Jul 2004 10:09 GMT
> I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
> point to another page with a list of hyperlinks. Each of these links point
> to a unique page I want to parse.

'WebCrawler'
<http://groups.google.com/groups?q=group%3Acomp.lang.java.*+webcrawle+koran>

HTH

Signature

Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology

William Brogden - 23 Jul 2004 13:53 GMT
>> > Hi
>> >
[quoted text clipped - 45 lines]
>
> AaA

You might find the code to JTidy to be useful
http://sourceforge.net/projects/jtidy

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty  
printer. Like its non-Java cousin, JTidy can be used as a tool for  
cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM  
parser for real-world HTML."

Bill
J. Chris Tilton - 23 Jul 2004 20:38 GMT
jtidy?   It is an html parser.

> Hi
>
[quoted text clipped - 18 lines]
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.716 / Virus Database: 472 - Release Date: 5/07/2004
Roedy Green - 23 Jul 2004 21:16 GMT
>1. I'm looking for a Java HTML parser
see http://mindprod.com/jgloss/parser.html

Signature

Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.