Hi, I have written a regular expression in order to choose some url
addresses that interrest me from an access log file.
I want to choose addresses that start with "http://" and end with
".html", ".htm", ".asp", ".php", ".aspx" or with a number.
The following pattern seems to only accept url's ending with ".html" or
".htm"
Does anybody has an idea why it doesn't recognize url's with other
endings?
The pattern I use is:
Pattern htmHtml = Pattern.compile("^(http://)\\S+((\\.htm) | (\\.html)
| (\\.asp) | (\\.php)| (\\.aspx) | / | (\\d))$");
It doesn't recognise the following url's:
http://www.galatasaray.org/Futbol/GS/anket/anket.asp
http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&La
ng=tr&Country=tr&b=1
http://www.aksiyon.com.tr/sonsayi210.php
It's possible that the problem is somewhere else in the code but I
wondered if you see something wrong in my pattern.
Regards,
Eren Aykin
Oliver Wong - 05 Jun 2006 15:29 GMT
> Hi, I have written a regular expression in order to choose some url
> addresses that interrest me from an access log file.
[quoted text clipped - 14 lines]
> http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&La
ng=tr&Country=tr&b=1
> http://www.aksiyon.com.tr/sonsayi210.php
Are you sure it accepts those that end with ".html"?
Could it have something to do with all those whitespaces in the pattern?
- Oliver
erenay - 06 Jun 2006 20:08 GMT
> Are you sure it accepts those that end with ".html"?
> Could it have something to do with all those whitespaces in the pattern?
You were right Oliver, the previous pattern matched only ".htm"s
I tried the pattern:
Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
And it doesn't match any URL's.
How should I do it?
Oliver Wong - 06 Jun 2006 20:32 GMT
>> Are you sure it accepts those that end with ".html"?
>> Could it have something to do with all those whitespaces in the
[quoted text clipped - 6 lines]
> And it doesn't match any URL's.
> How should I do it?
I'm not familiar with Java's particular variant of regular expressions,
but maybe the new problem is your addition of the square brackets. Did you
try the expression nkalagarla gave you?
<quote>
Try this.
Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
</quote>
- Oliver
erenay - 06 Jun 2006 20:59 GMT
Okay, I found the problem. My string had a space at the end of it so
using the following regex fixed my problem:
Pattern.compile("^(http://)\\S+(?:\\.htm|\\.html|\\.asp|\\.php|\\.aspx|/$|\\d+)
$");
Thanks for your interest.
Regards,
Eren Aykin
> I'm not familiar with Java's particular variant of regular expressions,
> but maybe the new problem is your addition of the square brackets. Did you
> try the expression nkalagarla gave you?
Jussi Piitulainen - 06 Jun 2006 21:16 GMT
>> I tried the pattern:
>> Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
[quoted text clipped - 4 lines]
> expressions, but maybe the new problem is your addition of the
> square brackets.
Certainly. Another problem may be the anchors ^$. See javadoc about
Pattern.MULTILINE flag and maybe enable that:
Pattern.compile("...", Pattern.MULTILINE)
> Did you try the expression nkalagarla gave you?
>
[quoted text clipped - 3 lines]
> Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
> </quote>
Or "http://\\S+(\\.html|\\.htm|\\.asp|\\.php|\\.aspx|\\d)", possibly
with anchors, possibly in multiline mode.
\S+? might be more appropriate than \S+, especially if \d is replaced
with \d+.
Some of these things depend on how the matcher is used.
nkalagarla@gmail.com - 05 Jun 2006 16:48 GMT
Try this.
Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
> Hi, I have written a regular expression in order to choose some url
> addresses that interrest me from an access log file.
[quoted text clipped - 20 lines]
> Regards,
> Eren Aykin