> 2. I am hoping to parse a batch of HTML Web pages. I
believe it should be
> relatively easy to do a single HTML page, but any tips
for multiple HTML
> pages? how will the parser know to go to the next HTML
page? I have like
> thousands of HTML pages to parse.
I don't think the parse would
go to 'next HTML pages'
automatically.
Anyway you can look for any
linked page in the parsed
document and
then start parsing those new
pages.
Luca Paganelli
Anony! - 22 Jul 2004 09:45 GMT
>>"Luca Paganelli wrote
> > 2. I am hoping to parse a
[quoted text clipped - 19 lines]
>
> Luca Paganelli
There are no links in those HTML pages. And yes I want something that will
automatically parse the next HTML file in a given directory.
AaA
Markus Schaber - 22 Jul 2004 11:48 GMT
Hi, Anony,
> There are no links in those HTML pages. And yes I want something that
> will automatically parse the next HTML file in a given directory.
Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.
Gruss,
Markus

Signature
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com
Anony! - 22 Jul 2004 12:35 GMT
"Markus Schaber" <individual-news@schabi.de> wrote in message
Hi, Anony,
On Thu, 22 Jul 2004 08:45:12 GMT
"Anony!" <someone@something.com> wrote:
> There are no links in those HTML pages. And yes I want something that
> will automatically parse the next HTML file in a given directory.
Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.
Gruss,
Markus
You mean store the files in a tree structure? and iterate through it?
AaA
Markus Schaber - 22 Jul 2004 13:36 GMT
Hi, Anony,
>> > There are no links in those HTML pages. And yes I want something
>> > that will automatically parse the next HTML file in a given
[quoted text clipped - 4 lines]
>
> You mean store the files in a tree structure? and iterate through it?
Why a tree structure?
You create a File object on the Directory, (isDirectory() should be true
then), and use listFiles(filter) to get a List of all files of this
Directory, then you can pass each of them to your html parser.
Markus

Signature
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com
> Hi
>
[quoted text clipped - 3 lines]
> parser is one option I could use, but I would like some other
> opinions/alternatives.
have a look at htmlparser on sourceforge.net
(http://htmlparser.sourceforge.net), which is probably more robust than
the standard Sun parser.
> 2. I am hoping to parse a batch of HTML Web pages. I believe it should be
> relatively easy to do a single HTML page, but any tips for multiple HTML
> pages? how will the parser know to go to the next HTML page? I have like
> thousands of HTML pages to parse.
Either you have a list of the pages/URLs that you provide to the parser,
or you parse additional URL's from the pages as you read them. As you
said in another response in this thread that the pages will not have
links to other pages, you must then have a list yourself.
Clearly, your computer cannot simply "guess" which pages to parse. If
the pages are stored locally, simply iterate over the directory(ies) in
which they are stored, parsing them one by one. If the pages are stored
on a server, perhaps there is an index page that you can parse to get a
list of pages.
Rogan

Signature
Rogan Dawes
*ALL* messages to discard@dawes.za.net will be dropped, and added
to my blacklist. Please respond to "nntp AT dawes DOT za DOT net"
Anony! - 23 Jul 2004 09:52 GMT
> > Hi
> >
[quoted text clipped - 23 lines]
> on a server, perhaps there is an index page that you can parse to get a
> list of pages.
Let me describe what I am trying to parse in greater detail.
I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
point to another page with a list of hyperlinks. Each of these links point
to a unique page I want to parse. Anyone lost yet? I don't know how to
download all of these pages in an automated fashion for parsing.
I think I can handle storing these files in a file system recognise in Java
and then parsing each of these files in the file system. Its the automated
download of all these webpages that makes me clueless.
Any help appreciated.
AaA
Andrew Thompson - 23 Jul 2004 10:09 GMT
> I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
> point to another page with a list of hyperlinks. Each of these links point
> to a unique page I want to parse.
'WebCrawler'
<http://groups.google.com/groups?q=group%3Acomp.lang.java.*+webcrawle+koran>
HTH

Signature
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
William Brogden - 23 Jul 2004 13:53 GMT
>> > Hi
>> >
[quoted text clipped - 45 lines]
>
> AaA
You might find the code to JTidy to be useful
http://sourceforge.net/projects/jtidy
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty
printer. Like its non-Java cousin, JTidy can be used as a tool for
cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM
parser for real-world HTML."
Bill
jtidy? It is an html parser.
> Hi
>
[quoted text clipped - 18 lines]
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.716 / Virus Database: 472 - Release Date: 5/07/2004
>1. I'm looking for a Java HTML parser
see http://mindprod.com/jgloss/parser.html

Signature
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.