Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / February 2004

Tip: Looking for answers? Try searching our database.

Java Regex

Thread view: 
E.C. - 27 Feb 2004 07:49 GMT
Hi,

I'm trying to match some text in HTML pages, e.g. the content between td
tags (which could have a newline between them). Here is a test that I've
been trying:

page = new StringBuffer("<tr>\n <td>blah</td>\n </tr>"); // actually, has
whole HTML page
Pattern p = Pattern.compile(".*<td>.*</td>.*", Pattern.DOTALL);
Matcher m = p.matcher(page.toString());
System.out.println("Match finder: " + m.matches());
String header = m.group();
if (header != null & header.length() > 0)
   System.out.println("Found the header info: " + header);
else
   System.out.println("Unable to find header info!");

I understand why this matches the entire text (because of the outer .*'s),
however I just want to match the text inside the td tags. I tried:
".*<td>\\(.*\\)</td>.*" but it didn't work.

Question: How can I retrieve just the contents of the tag, and not
everything else? Any advice appreciated.

Cheers,

Mike
Chris Smith - 27 Feb 2004 14:34 GMT
> I understand why this matches the entire text (because of the outer .*'s),
> however I just want to match the text inside the td tags. I tried:
> ".*<td>\\(.*\\)</td>.*" but it didn't work.

It doesn't help when you post code that doesn't work and you know why.  
If you post code that you expect to work, when really it doesn't, then
it's much more possible to help you.  How about posting the code where
you define a group with parens and try to retrieve that group's value?

Signature

www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

E.C. - 27 Feb 2004 19:24 GMT
> > I understand why this matches the entire text (because of the outer .*'s),
> > however I just want to match the text inside the td tags. I tried:
[quoted text clipped - 4 lines]
> it's much more possible to help you.  How about posting the code where
> you define a group with parens and try to retrieve that group's value?

I expected ".*<td>\\(.*\\)</td>.*" to work, matching the text inside the
group, which is surrounded by td tags, and I called m.group() to obtain a
matching group. m.groupCount() returns 1 (the whole thing only is matched).
I also tried: ".*<td>(.*)</td>.*" which matches everything, with
m.groupCount() returning 1 again. Yet shouldn't it be matching 2 parts: The
whole and the group?

StringBuffer page = new StringBuffer("<tr>\n <td>blah</td>\n </tr>"); //
actually, has whole HTML page
Pattern p = Pattern.compile(".*<td>(.*)</td>.*", Pattern.DOTALL);
Matcher m = p.matcher(page.toString());
System.out.println("Match finder: " + m.matches());
System.out.println("Match groups: " + m.groupCount());

Cheers,

Mike
Collin VanDyck - 27 Feb 2004 19:35 GMT
> I expected ".*<td>\\(.*\\)</td>.*" to work, matching the text inside the
> group, which is surrounded by td tags, and I called m.group() to obtain a
> matching group. m.groupCount() returns 1 (the whole thing only is matched).
> I also tried: ".*<td>(.*)</td>.*" which matches everything, with
> m.groupCount() returning 1 again. Yet shouldn't it be matching 2 parts: The
> whole and the group?

m.groupCount()  simply returns the number of matching groups in your regular
expression pattern.

If you are trying to find out what matched, use this paradigm:

Pattern p = Pattern.compile(".*<td>(.*)</td>.*");
Matcher m = p.matcher(someinputstring);
if (m.matches()) {
   String insideMatch = m.group(1);
   String entireMatch = m.group();
}

your insideMatch would then be whatever was in between the TDs.

Note though that REs are by nature greedy.  Meaning, that if you have a
start <TD> and then after many other start and end TDs and tables you have a
</TD>, it will contain everything in the middle, including other markup
possibly beyond what you are intending.

If you want to match everything up until the close TD, use the '?' RE
modifier as such:

.*<td>(.*?)</td>.*

-CV
E.C. - 27 Feb 2004 19:51 GMT
> > I expected ".*<td>\\(.*\\)</td>.*" to work, matching the text inside the
> > group, which is surrounded by td tags, and I called m.group() to obtain a
[quoted text clipped - 18 lines]
>
> your insideMatch would then be whatever was in between the TDs.

Ah, I see what you mean. That works great, cheers :)

Mike


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.