Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

Problem with regular expression for matching the url endings

Thread view: 
erenay - 05 Jun 2006 15:08 GMT
Hi, I have written a regular expression in order to choose some url
addresses that interrest me from an access log file.
I want to choose addresses that start with "http://" and end with
".html", ".htm", ".asp", ".php", ".aspx" or with a number.
The following pattern seems to only accept url's ending with ".html" or
".htm"
Does anybody has an idea why it doesn't recognize url's with other
endings?

The pattern I use is:
Pattern htmHtml = Pattern.compile("^(http://)\\S+((\\.htm) | (\\.html)
| (\\.asp) | (\\.php)| (\\.aspx) | / | (\\d))$");

It doesn't recognise the following url's:

http://www.galatasaray.org/Futbol/GS/anket/anket.asp
http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&La
ng=tr&Country=tr&b=1

http://www.aksiyon.com.tr/sonsayi210.php

It's possible that the problem is somewhere else in the code but I
wondered if you see something wrong in my pattern.

Regards,
Eren Aykin
Oliver Wong - 05 Jun 2006 15:29 GMT
> Hi, I have written a regular expression in order to choose some url
> addresses that interrest me from an access log file.
[quoted text clipped - 14 lines]
> http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&La
ng=tr&Country=tr&b=1

> http://www.aksiyon.com.tr/sonsayi210.php

   Are you sure it accepts those that end with ".html"?

   Could it have something to do with all those whitespaces in the pattern?

   - Oliver
erenay - 06 Jun 2006 20:08 GMT
>     Are you sure it accepts those that end with ".html"?
>     Could it have something to do with all those whitespaces in the pattern?

You were right Oliver, the previous pattern matched only ".htm"s

I tried the pattern:
Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
And it doesn't match any URL's.
How should I do it?
Oliver Wong - 06 Jun 2006 20:32 GMT
>>     Are you sure it accepts those that end with ".html"?
>>     Could it have something to do with all those whitespaces in the
[quoted text clipped - 6 lines]
> And it doesn't match any URL's.
> How should I do it?

   I'm not familiar with Java's particular variant of regular expressions,
but maybe the new problem is your addition of the square brackets. Did you
try the expression nkalagarla gave you?

<quote>
Try this.

Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
</quote>

   - Oliver
erenay - 06 Jun 2006 20:59 GMT
Okay, I found the problem. My string had a space at the end of it so
using the following regex fixed my problem:
Pattern.compile("^(http://)\\S+(?:\\.htm|\\.html|\\.asp|\\.php|\\.aspx|/$|\\d+)
$");
Thanks for your interest.
Regards,
Eren Aykin

> I'm not familiar with Java's particular variant of regular expressions,
> but maybe the new problem is your addition of the square brackets. Did you
> try the expression nkalagarla gave you?
Jussi Piitulainen - 06 Jun 2006 21:16 GMT
>> I tried the pattern:
>> Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
[quoted text clipped - 4 lines]
> expressions, but maybe the new problem is your addition of the
> square brackets.

Certainly. Another problem may be the anchors ^$. See javadoc about
Pattern.MULTILINE flag and maybe enable that:

  Pattern.compile("...", Pattern.MULTILINE)

> Did you try the expression nkalagarla gave you?
>
[quoted text clipped - 3 lines]
> Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
> </quote>

Or "http://\\S+(\\.html|\\.htm|\\.asp|\\.php|\\.aspx|\\d)", possibly
with anchors, possibly in multiline mode.

\S+? might be more appropriate than \S+, especially if \d is replaced
with \d+.

Some of these things depend on how the matcher is used.
nkalagarla@gmail.com - 05 Jun 2006 16:48 GMT
Try this.

Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");

> Hi, I have written a regular expression in order to choose some url
> addresses that interrest me from an access log file.
[quoted text clipped - 20 lines]
> Regards,
> Eren Aykin


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.