Below is a small test program I wrote to try and
do a simple parse of an XML expression, where I
can extract the tag(s) and the data on a single
line. Yes, I know about the other ways to parse
real XML, but I am trying to learn Java only. My
test case is very simple (see below). The problem
seems to be something tricky about the fact that
I am reading the input from the console.
I have tried the regexp in all of the following forms:
Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>");
Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>\n");
Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>\r\n");
In Windows cmd.exe, none of these match when I enter
<t1>foo</t1>
as standard input.
Any advice would be greatly appreciated.
Mitch
-----------------------------------------------------------------------------------------------
import java.io.*;
import java.net.*;
import java.util.regex.*;
public class test {
public static void main(String[] args) throws IOException {
PrintWriter out = null;
BufferedReader stdIn = null;
String server = "";
String userInput;
stdIn = new BufferedReader(new InputStreamReader(System.in));
// read arguments
if(args.length == 1) {
server = args[0];
} else {
System.out.println("no args");
}
// this one works, but is not really what I want
// Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)<(\\S+)>");
// this one is the correct one that won't match unless the closing tag
matches
// the opening tag, but I cannot get it to work with input from the
console...
Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>\r\n");
Matcher m1 = p1.matcher("<t1>foo</t1>\r\n");
System.out.println("matched test string = " + m1.matches());
while ((userInput = stdIn.readLine()) != null) {
System.out.println("got user input: " + userInput + " length " +
userInput.length());
// Now see if the pattern matches
Matcher m = p1.matcher(userInput);
System.out.println("matched = " + m.matches());
System.out.println("numGroups found: " + m.groupCount() + "\n");
// If there were matches, print out the groups found
if (m.matches()) {
for (int j = 1; j <= m.groupCount(); j++) {
System.out.println("group " + m.group(j) + " found\n");
} // end for
} // end if
} // end while
stdIn.close();
} // end main
} // end class test
david.karr - 02 Jul 2007 21:45 GMT
On Jul 2, 11:31 am, "mitch...@yahoo.com" <mitch...@yahoo.com> wrote:
> Below is a small test program I wrote to try and
> do a simple parse of an XML expression, where I
> can extract the tag(s) and the data on a single
> line. Yes, I know about the other ways to parse
> real XML, but I am trying to learn Java only.
You're going to be following all sorts of gnarly twisty passages if
you try to avoid not learning XML. The functionality for parsing XML
is easily available in standard Java libraries.
Feel free to explore regular expressions as an intellectual exercise,
but it's a waste of time if you're actually trying to produce real
code to parse XML.
timjowers - 02 Jul 2007 21:46 GMT
On Jul 2, 2:31 pm, "mitch...@yahoo.com" <mitch...@yahoo.com> wrote:
> Below is a small test program I wrote to try and
> do a simple parse of an XML expression, where I
[quoted text clipped - 85 lines]
>
> } // end class test
It works.
Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>");
you may be putting a whitespace in the text of the element. Try
revising the regexp to look for anything not the terminator. E.g. this
works as is:
<i>test</i>
Yet this does not.
<i>test two</i>
TimJOwers
kaldrenon - 02 Jul 2007 21:58 GMT
> E.g. this
> works as is:
[quoted text clipped - 4 lines]
>
> TimJOwers
Which could easily be fixed by replacing the (\\S+) in the middle with
(.?) or (.+), I believe.
Roedy Green - 02 Jul 2007 22:18 GMT
On Mon, 02 Jul 2007 11:31:08 -0700, "mitchmcc@yahoo.com"
<mitchmcc@yahoo.com> wrote, quoted or indirectly quoted someone who
said :
> Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>");
> Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>\n");
> Pattern p1 = Pattern.compile("<(\\S+)>(\\S+)</\\1>\r\n");
You have 4 things that have to work for your regex as a whole to work.
Chop your pattern down to just match <t1> then when you get the
working add the next bit.
Instead of trying all possibilities of \n, have a look at your string
and see what is on the end. use charAt to examine it.
see http://mindprod.com/jgloss/regex.html
.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com