Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls from http://google.com i only get 4 results
in my array:
* http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi
Here is the code of my function:
public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");
Matcher m = p.matcher(content);
if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls[i] = m.group(i);
}
}
}
Andrew Thompson - 18 Nov 2007 05:46 GMT
> Hi, I made a ...
..little boo-boo in multi-posting this message
to comp.lang.java.help, after making a post to
comp.lang.java.programmer.
Please refrain from multi-posting, in future.
X-post to c.l.j.p./h., w/ f-u to c.l.j.h. only.
--
Andrew T.
SadRed - 18 Nov 2007 05:53 GMT
> Hi, I made a little function to extract urls from any content with a
> regular expression but it doesn't really work.
[quoted text clipped - 22 lines]
>
> }
Don't clutter the forum with your multi posts, please!
Your regex code is very wrong. Study this code and go to bed. I didn't
touch your weird regex string but I firmly believe it is also wrong
for your desired purpose which I don't know in its details.
----------------------------------------------
import java.net.*;
import java.util.regex.*;
import java.io.*;
import java.util.*;
public class Mnm{
public static void main(String[] args) throws Exception{
String contStr = "";
String line = null;
Locale.setDefault(Locale.US);
// String urlStr = "http://google.com";
String urlStr = "http://www.google.com/ig?hl=en";
if (args.length > 0){
urlStr = args[0];
}
URL url = new URL(urlStr);
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null){
contStr += line;
}
findUrl(contStr);
}
public static void findUrl(String content) {
int gc, counter, gcounter;
gc = counter = gcounter = 0;
Pattern p = Pattern.compile
("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\
\+\\%/\\.\\w]+)?");
Matcher m = p.matcher(content);
gc = m.groupCount();
for (int i = 0; i <= gc; ++i){
System.out.println("GROUP" + i + " : ");
while (m.find()){
++counter;
++gcounter;
System.out.println(gcounter + ".> " + m.group(i));
}
m.reset(content); // for next group
gcounter = 0;
}
if (counter == 0){
System.out.println("--no match--");
}
}
}
----------------------------------------
mnml - 18 Nov 2007 16:58 GMT
> > Hi, I made a little function to extract urls from any content with a
> > regular expression but it doesn't really work.
[quoted text clipped - 84 lines]
>
> ----------------------------------------
Thanks for your example, yeah the regexp is wrong with your example it
was returning stuff like:
3.> http://www.google.com/favicon.ico
4.> http://www.google.com/favicon.ico
5.> WeTHhV4cOxM.js
6.> document.location.hostname
7.> domain.indexOf
8.> domain.substring
9.> document.cookie
Roedy Green - 19 Nov 2007 08:54 GMT
>Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
>zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");
to find out the problem, keep chopping the tail end off and redoing
the search. When the elements it missed come back, you know the
problem was in the bit you just chopped.
I usually compose these just a bit at a time, adding on just a phrase
before testing.
for other hints see http://mindprod.com/jgloss/regex.html

Signature
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
mnml - 20 Nov 2007 03:04 GMT
On Nov 19, 8:54 am, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:
> >Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
> >zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");
[quoted text clipped - 10 lines]
> Roedy Green Canadian Mind Products
> The Java Glossaryhttp://mindprod.com
ok, thank you for the link
Chris - 20 Nov 2007 04:39 GMT
> Hi, I made a little function to extract urls from any content with a
> regular expression but it doesn't really work.
If you're extracting URLs from HTML, it's a lot easier to try to
recognize the anchor tags. Write a regex to recognize:
<a ~ href="~" ~ >
where ~ means "up to". I've implemented this in a lexer and it works
reliably. (Regexes work a little differently in a lexer, so I don't have
a regex to post). Just adjust to handle mixed case.
mnml - 21 Nov 2007 13:46 GMT
> > Hi, I made a little function to extract urls from any content with a
> > regular expression but it doesn't really work.
[quoted text clipped - 7 lines]
> reliably. (Regexes work a little differently in a lexer, so I don't have
> a regex to post). Just adjust to handle mixed case.
ok, thank you :)