Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2007

Tip: Looking for answers? Try searching our database.

Ahhh.. URL wants to get encoded.  Does Java wanna?

Thread view: 
François - 06 Nov 2007 05:04 GMT
Now:

>From the <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/net/
URL.html">URL Class</a>:

"Note, the URI class does perform escaping of its component fields in
certain circumstances. The recommended way to manage the encoding and
decoding of URLs is to use URI, and to convert between these two
classes using toURI() and URI.toURL().

The URLEncoder and URLDecoder classes can also be used, but only for
HTML form encoding, which is not the same as the encoding scheme
defined in RFC2396."

I don't care about form encoding.  Just want to encode a string into a
readable URL (RFC2396: thanks Berners-Lee, you described the web.  The
implementation is rough around the edges, sometimes).

So, from what I understand of the mentionned Java 5.0 docs, I wrote
this Java line, non-sensical, but doc-abiding:

encodedURL = (new URL(urlString)).toURI().toURL();

If I want to get the encoded URL, URL-ready:

encodedURL.toString();

Ahhh...  the original string, not encoded at all.

Please help, before, I go back to PHP and cry sour tears:

echo (urlEncode(http://www.javahhh.com/somephpfile.php?
myunrequisitedlove=Java&myworkhorse=PHP));

>>>> http://www.javahhh.com/somephpfile.php?myunrequisitedlove=Java&amp;mywife=PHP

Thanks, fellow coders
François - 06 Nov 2007 05:10 GMT
On Nov 6, 12:04 am, Fran?ois <francois.x.h...@gmail.com> wrote:
> Now:
>
[quoted text clipped - 34 lines]
>
> Thanks, fellow coders

Ohhh:  even Google newsgroup parser understands it:

& = &amp;

I did it!  I encoded the ampersand (by hand mind you).(Sheer
brainpower)
Sabine Dinis Blochberger - 06 Nov 2007 09:42 GMT
> > Now:
> >
[quoted text clipped - 10 lines]
> > HTML form encoding, which is not the same as the encoding scheme
> > defined in RFC2396."

What I understood the correct way to be is, encode your URL parameter
part (after the '?') in UTF-8 then use java.net.UrlEncoder.encode().

> Ohhh:  even Google newsgroup parser understands it:
>
> & = &amp;
>
> I did it!  I encoded the ampersand (by hand mind you).(Sheer
> brainpower)

Signature

Sabine Dinis Blochberger

Op3racional
www.op3racional.eu

Roedy Green - 06 Nov 2007 05:15 GMT
On Tue, 06 Nov 2007 05:04:05 -0000, François
<francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone
who said :

>.  Just want to encode a string into a
>readable URL (RFC2396:

see http://mindprod.com/jgloss/urlencoded.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Wayne - 06 Nov 2007 06:24 GMT
> On Tue, 06 Nov 2007 05:04:05 -0000, François
> <francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone
[quoted text clipped - 4 lines]
>
> see http://mindprod.com/jgloss/urlencoded.html

Roedy,

I just tried using URI, it doesn't seem to escape/encode
an ampersand in any part of the URI.  Also, what about the
new IRIs?  A Java program should be robust enough to
handle legal URLs/URIs/IRIs, converting the the (upto)
nine parts of an IRI correctly.  My understanding of
your (excellent) urlencoded page and the API docs means this:

     URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null);
     System.out.println( uri.toURL() );

should produce:
   http://www.example.com/you%20&%20I%2010%25?%20wierd%20%26%20wierder
But it produces:
   http://www.example.com/you%20&%20I%2010%25?%20wierd%20&%20wierder

(The ampersand is not encoded.)  What did I do wrong?

-Wayne
Roedy Green - 06 Nov 2007 08:06 GMT
>  URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null);
>      System.out.println( uri.toURL() );

the way I read RFC 2396 is that reserved chars:  
      ; / ? : @ & = + $ ,
are not supposed to be escaped.  Perhaps Patricia could read the RFC
and tell us what it really means.

I wish the people who write RFCs would provide examples to illustrate
the true meaning of the lawyerese.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Owen Jacobson - 06 Nov 2007 21:32 GMT
On Nov 6, 12:06 am, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:

> >  URI uri = new URI("http", "//www.example.com/you& I 10%? wierd & wierder", null);
> >      System.out.println( uri.toURL() );
[quoted text clipped - 3 lines]
> are not supposed to be escaped.  Perhaps Patricia could read the RFC
> and tell us what it really means.

The character & is used in URLs and URIs to separate parts of the
query, in which case it should be present as an actual & character.
It can also occur inside query paramater names or values, in which
case it should be present in aencoded form, as the string %26.  The
example URI Wayne gave uses (unintentionally) ampersands as query
separators, which is why the URI class isn't escaping them; if he
wants to use them as part of the path or part of query parameters or
values he'll have to encode them himself with
URLEncoder.encode(String, String) or similar.

Elsewhere within a URI, ampersands are not reserved and does not
require encoding, except in the scheme part, where they're simply
illegal.

> I wish the people who write RFCs would provide examples to illustrate
> the true meaning of the lawyerese.

The RFCs tend to be a codification of existing practice, rather than a
prescription.  In the case of the URI RFC it's a little vaguer, since
URIs (that are not also URLs) aren't in terribly widespread use and
came about as an attempt to normalize URLs, so the RFC could be seen
as prescriptive rather than informative.

On the whole, this post-hoc RFC process works well: it gives the
people creating prototypes time and freedom to play with ideas and
discard the bad ones without prematurely codifying them in a
standard.  It's not perfect, but then, what is?

-Owen
Wayne - 06 Nov 2007 08:06 GMT
>> On Tue, 06 Nov 2007 05:04:05 -0000, François
>> <francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone
[quoted text clipped - 24 lines]
>
> -Wayne

I guess the answer is to encode the query part separately, if needed.
The following code seems to work:

public String encodeURL ( String initialURL, boolean parseQuery )
{
 // Parse the URL (without encoding):
 URL url = new URL( initialURL );
 String scheme = url.getProtocol();     // E.g., "http"
 String authority = url.getAuthority(); // E.g., "//user@host:port"
 String path = url.getPath();           // E.g., "/foo/bar.htm"
 String query = url.getQuery();         // E.g., "foo=bar" (starts with '?")
 if ( parseQuery )
    query = URLEncoder.encode( query, "UTF-8" );
 String fragment = url.getRef();        // I.e., the "anchor"

 // Assemble the encoded URL, using URI class to properly
 // encode each part:
 URI uri = new URI( scheme, authority, path, query, fragment );
 return uri.toString();
}

-Wayne
Steven Simpson - 06 Nov 2007 09:33 GMT
> encodedURL = (new URL(urlString)).toURI().toURL();
>
[quoted text clipped - 10 lines]
> http://www.javahhh.com/somephpfile.php?myunrequisitedlove=Java&amp;mywife=PHP
>  

The conversion from "&" to "&amp;" is not relevant to URI encoding - it
is HTML-encoding (and XML, etc).  java.net.URI has no knowledge of this,
and should not have.  It does not know whether you're going to put the
result into an HTML file or something else.

If you're writing out any literal text as part of HTML, including a URI
with "&" in it, you independently need an encoding to map "&" to
"&amp;", "<" to "&lt;", etc.

Signature

ss at comp dot lancs dot ac dot uk                                     |

François - 06 Nov 2007 13:37 GMT
Fellow Coders:

Thanks very much.  Read all the fine answers, which put me on the
right track.

The simple goal is to produce a sitemap.xml, with the url properly
encoded (as specified at http://sitemaps.org/protocol.php).

An exemple, taken from this sitemaps.org:

http://www.example.com/?mlat.php&q=name  should become
http://www.example.com/%C3%BCmlat.php&amp;q=name

(If the above is encoded by Google Group parser: ?mlaut => %C3%BCmlat
and & => &amp;)

The code snippet posted by Wayne works ok, but the ?mlaut stays an
?mlaut, since it is not part of the query.

And we can't URLEncode the whole string, since forward slashes and
other valid characters will be transformed in UTF8 char codes.

The brute force way to do it (and the only way I found could work with
the sitemap.org example) is to take the initial string, parse every
single char, replace <, >, & and " with their escaped version (&amp;,
&lt;, etc.) as Steven indicated, and finally test any remaining char
to be within the range \u0000 to \u007F (the Basic Latin block) and
encode any char outside that range with this class, taken straight out
of the W3C website (http://www.w3.org/International/O-URL-code.html):

/**********************************************************/

public class URLUTF8Encoder
{

 final static String[] hex = {
   "%00", "%01", "%02", "%03", "%04", "%05", "%06", "%07",
   "%08", "%09", "%0a", "%0b", "%0c", "%0d", "%0e", "%0f",
   "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17",
   "%18", "%19", "%1a", "%1b", "%1c", "%1d", "%1e", "%1f",
   "%20", "%21", "%22", "%23", "%24", "%25", "%26", "%27",
   "%28", "%29", "%2a", "%2b", "%2c", "%2d", "%2e", "%2f",
   "%30", "%31", "%32", "%33", "%34", "%35", "%36", "%37",
   "%38", "%39", "%3a", "%3b", "%3c", "%3d", "%3e", "%3f",
   "%40", "%41", "%42", "%43", "%44", "%45", "%46", "%47",
   "%48", "%49", "%4a", "%4b", "%4c", "%4d", "%4e", "%4f",
   "%50", "%51", "%52", "%53", "%54", "%55", "%56", "%57",
   "%58", "%59", "%5a", "%5b", "%5c", "%5d", "%5e", "%5f",
   "%60", "%61", "%62", "%63", "%64", "%65", "%66", "%67",
   "%68", "%69", "%6a", "%6b", "%6c", "%6d", "%6e", "%6f",
   "%70", "%71", "%72", "%73", "%74", "%75", "%76", "%77",
   "%78", "%79", "%7a", "%7b", "%7c", "%7d", "%7e", "%7f",
   "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87",
   "%88", "%89", "%8a", "%8b", "%8c", "%8d", "%8e", "%8f",
   "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97",
   "%98", "%99", "%9a", "%9b", "%9c", "%9d", "%9e", "%9f",
   "%a0", "%a1", "%a2", "%a3", "%a4", "%a5", "%a6", "%a7",
   "%a8", "%a9", "%aa", "%ab", "%ac", "%ad", "%ae", "%af",
   "%b0", "%b1", "%b2", "%b3", "%b4", "%b5", "%b6", "%b7",
   "%b8", "%b9", "%ba", "%bb", "%bc", "%bd", "%be", "%bf",
   "%c0", "%c1", "%c2", "%c3", "%c4", "%c5", "%c6", "%c7",
   "%c8", "%c9", "%ca", "%cb", "%cc", "%cd", "%ce", "%cf",
   "%d0", "%d1", "%d2", "%d3", "%d4", "%d5", "%d6", "%d7",
   "%d8", "%d9", "%da", "%db", "%dc", "%dd", "%de", "%df",
   "%e0", "%e1", "%e2", "%e3", "%e4", "%e5", "%e6", "%e7",
   "%e8", "%e9", "%ea", "%eb", "%ec", "%ed", "%ee", "%ef",
   "%f0", "%f1", "%f2", "%f3", "%f4", "%f5", "%f6", "%f7",
   "%f8", "%f9", "%fa", "%fb", "%fc", "%fd", "%fe", "%ff"
 };

 /**
  * Encode a string to the "x-www-form-urlencoded" form, enhanced
  * with the UTF-8-in-URL proposal. This is what happens:
  *
  * <ul>
  * <li><p>The ASCII characters 'a' through 'z', 'A' through 'Z',
  *        and '0' through '9' remain the same.
  *
  * <li><p>The unreserved characters - _ . ! ~ * ' ( ) remain the
same.
  *
  * <li><p>The space character ' ' is converted into a plus sign '+'.
  *
  * <li><p>All other ASCII characters are converted into the
  *        3-character string "%xy", where xy is
  *        the two-digit hexadecimal representation of the character
  *        code
  *
  * <li><p>All non-ASCII characters are encoded in two steps: first
  *        to a sequence of 2 or 3 bytes, using the UTF-8 algorithm;
  *        secondly each of these bytes is encoded as "%xx".
  * </ul>
  *
  * @param s The string to be encoded
  * @return The encoded string
  */
 public static String encode(String s)
 {
   StringBuffer sbuf = new StringBuffer();
   int len = s.length();
   for (int i = 0; i < len; i++) {
     int ch = s.charAt(i);
     if ('A' <= ch && ch <= 'Z') {        // 'A'..'Z'
       sbuf.append((char)ch);
     } else if ('a' <= ch && ch <= 'z') {    // 'a'..'z'
          sbuf.append((char)ch);
     } else if ('0' <= ch && ch <= '9') {    // '0'..'9'
          sbuf.append((char)ch);
     } else if (ch == ' ') {            // space
          sbuf.append('+');
     } else if (ch == '-' || ch == '_'        // unreserved
         || ch == '.' || ch == '!'
         || ch == '~' || ch == '*'
         || ch == '\'' || ch == '('
         || ch == ')') {
       sbuf.append((char)ch);
     } else if (ch <= 0x007f) {        // other ASCII
          sbuf.append(hex[ch]);
     } else if (ch <= 0x07FF) {        // non-ASCII <= 0x7FF
          sbuf.append(hex[0xc0 | (ch >> 6)]);
          sbuf.append(hex[0x80 | (ch & 0x3F)]);
     } else {                    // 0x7FF < ch <= 0xFFFF
          sbuf.append(hex[0xe0 | (ch >> 12)]);
          sbuf.append(hex[0x80 | ((ch >> 6) & 0x3F)]);
          sbuf.append(hex[0x80 | (ch & 0x3F)]);
     }
   }
   return sbuf.toString();
 }

}

/**********************************************************/

Are we having fun yet?  In this particular case (a very common case),
1 line of PHP equals over 100 lines of Java.  My KLOC just went
through the roof and my employer suggested I take a very long
vacation, in some remote location.

Thanks again all.
Daniel Pitts - 06 Nov 2007 15:59 GMT
> Fellow Coders:
>
[quoted text clipped - 136 lines]
>
> Thanks again all.

I would bet that you could find an Apache Commons API that does this for
you!

Also, if you were using JSP technology, you could use <c:out
value="${url}" escapeXml="true" />  So, to compare Java code with PHP
doesn't make sense.  PHP is /designed/ for HTML templates, Java is not.
 JSP is, so it has that same functionality that you expect from PHP.

Hope this helps,
Daniel.

Signature

Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Steven Simpson - 06 Nov 2007 16:08 GMT
> An exemple, taken from this sitemaps.org:
>
[quoted text clipped - 9 lines]
> of the W3C website (http://www.w3.org/International/O-URL-code.html):
>  

You should really do the %-encoding first, then the &;-encoding, for
symmetry with parsing at the other end, where it will be &;-decoded
first, then %-decoded.

> public class URLUTF8Encoder
>  
[quoted text clipped - 5 lines]
>    *
>  

I think this is too much at this stage.  This space->plus conversion,
and its corresponding '+'->'%XX', must have already been done in order
to form the URI; you can't do it to a complete URI, as the spaces that
became pluses when it was formed will then become %XX.

If you already have the URI, all you're doing now is making it ASCII
compatible.

> Are we having fun yet?  In this particular case (a very common case),
> 1 line of PHP equals over 100 lines of Java.

No, I've just noticed URI.toASCIIString()!

Signature

ss at comp dot lancs dot ac dot uk                                     |



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.