On 20 Mar 2006 18:32:27 -0800, "Dave Mandelin"
<mandelin@cs.berkeley.edu> wrote, quoted or indirectly quoted someone
who said :
>Can you give some examples of how it fails on poorly written HTML? It
>may not be that hard to bulletproof the tag-stripping code you wrote.
I wrote a tag stripper, but it presumes valid HTML. I suppose you
could on hitting an < in a tag presume the > was missing. and insert
one just before the first space after the last <
You could look for standard tags.
The other common error is as < or > lying around by itself or next to
=.
From a practical point of view it might be easiest to run your code
through a verifier and fix the errors then do your strip. See
http://mindprod.com/jgloss/htmlvalidator.html
Anything else is going to lose some data or insert some junk.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
google@lrlart.com - 21 Mar 2006 07:28 GMT
One failure I've run into is with the use of javascript--for example
<script>
function CNN_getCookies() {
var hash = new Array;
if ( document.cookie ) {
var cookies = document.cookie.split( '; ' );
for ( var i = 0; i < cookies.length; i++ ) {
......
Note: Notice the "less than" symbol in the javascript above.
</script>
This is some slightly modified source from cnn's site--but the point is
that a "<tag>" pattern can be distinguished, but it's difficult to
differentiate this from a greater than or less than in some enclosed
javascript code.
But even if I were to write some code that could handle this case
effectively I'd probably be dealing with loads of other special cases
within poorly written html source.
Chris Uppal - 21 Mar 2006 11:18 GMT
> But even if I were to write some code that could handle this case
> effectively I'd probably be dealing with loads of other special cases
> within poorly written html source.
Take it from me: parsing HTML is not trivial. And that's even without
considering all the invalid HTML out there (I don't mean stuff like incorrectly
nested structures, but unmatched ""s, tags with no >, etc).
JTidy appears to do what you are looking for, it might help (I've never tried
it myself):
http://jtidy.sourceforge.net/
-- chris
Dave Mandelin - 21 Mar 2006 20:41 GMT
Ah, I see. Yeah, that looks pretty rough. JTidy looks like a really
nice program.