Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / April 2007

Tip: Looking for answers? Try searching our database.

ParserCallback - Html Parser in Java

Thread view: 
k4 - 01 Apr 2007 23:05 GMT
Hi!

I have problem, my class parse html document work pretty good, but if in
html document find "<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en" lang="en">" then return error. Why ?

import javax.swing.text.html.*;
import javax.swing.text.*;

import java.net.*;

public class HTMLParse extends HTMLEditorKit.ParserCallback{
    int begin=0;
    int end = 0;
   
    //public void handleError(String errorMsg,int pos){
  //     System.out.println("Wystapil blad: "+errorMsg);
        //System.exit(2); //gdybysmy chcieli wyjsc na problem z tagiem,
ale nie chcemy
  //  }
   
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet attrSet,
int pos) {
            if (tag == HTML.Tag.P) {
                begin++;
            }
        }
   
    public void handleEndTag(HTML.Tag t,int pos){
        if (t == HTML.Tag.STRONG) {
            end++;
        }
    }
       
        public void handleText(char[] data, int cos){
           
            if(begin == 3 || begin == 4 || begin == 5 )
            System.out.println(data);
           
        }
       
}
Tom Hawtin - 02 Apr 2007 00:30 GMT
> I have problem, my class parse html document work pretty good, but if in
> html document find "<html xmlns="http://www.w3.org/1999/xhtml"
> xml:lang="en" lang="en">" then return error. Why ?
>
> import javax.swing.text.html.*;

The Swing HTML parse is ancient (and basic). Modern XHTML (XML format) I
believe will give it problems.

You could use something like JTidy (google it) to reformat the document
as old school HTML. Alternatively a short SAX handler could remove the
XMLisms (like using <x/> instead of <x></x>).

Tom Hawtin


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.