Java Forum / General / November 2007
Ahhh.. URL wants to get encoded. Does Java wanna?
François - 06 Nov 2007 05:04 GMT Now:
>From the <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/net/ URL.html">URL Class</a>:
"Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL().
The URLEncoder and URLDecoder classes can also be used, but only for HTML form encoding, which is not the same as the encoding scheme defined in RFC2396."
I don't care about form encoding. Just want to encode a string into a readable URL (RFC2396: thanks Berners-Lee, you described the web. The implementation is rough around the edges, sometimes).
So, from what I understand of the mentionned Java 5.0 docs, I wrote this Java line, non-sensical, but doc-abiding:
encodedURL = (new URL(urlString)).toURI().toURL();
If I want to get the encoded URL, URL-ready:
encodedURL.toString();
Ahhh... the original string, not encoded at all.
Please help, before, I go back to PHP and cry sour tears:
echo (urlEncode(http://www.javahhh.com/somephpfile.php? myunrequisitedlove=Java&myworkhorse=PHP));
>>>> http://www.javahhh.com/somephpfile.php?myunrequisitedlove=Java&mywife=PHP Thanks, fellow coders
François - 06 Nov 2007 05:10 GMT On Nov 6, 12:04 am, Fran?ois <francois.x.h...@gmail.com> wrote:
> Now: > [quoted text clipped - 34 lines] > > Thanks, fellow coders Ohhh: even Google newsgroup parser understands it:
& = &
I did it! I encoded the ampersand (by hand mind you).(Sheer brainpower)
Sabine Dinis Blochberger - 06 Nov 2007 09:42 GMT > > Now: > > [quoted text clipped - 10 lines] > > HTML form encoding, which is not the same as the encoding scheme > > defined in RFC2396." What I understood the correct way to be is, encode your URL parameter part (after the '?') in UTF-8 then use java.net.UrlEncoder.encode().
> Ohhh: even Google newsgroup parser understands it: > > & = & > > I did it! I encoded the ampersand (by hand mind you).(Sheer > brainpower)
 Signature Sabine Dinis Blochberger
Op3racional www.op3racional.eu
Roedy Green - 06 Nov 2007 05:15 GMT On Tue, 06 Nov 2007 05:04:05 -0000, François <francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone who said :
>. Just want to encode a string into a >readable URL (RFC2396: see http://mindprod.com/jgloss/urlencoded.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Wayne - 06 Nov 2007 06:24 GMT > On Tue, 06 Nov 2007 05:04:05 -0000, François > <francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone [quoted text clipped - 4 lines] > > see http://mindprod.com/jgloss/urlencoded.html Roedy,
I just tried using URI, it doesn't seem to escape/encode an ampersand in any part of the URI. Also, what about the new IRIs? A Java program should be robust enough to handle legal URLs/URIs/IRIs, converting the the (upto) nine parts of an IRI correctly. My understanding of your (excellent) urlencoded page and the API docs means this:
URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null); System.out.println( uri.toURL() );
should produce: http://www.example.com/you%20&%20I%2010%25?%20wierd%20%26%20wierder But it produces: http://www.example.com/you%20&%20I%2010%25?%20wierd%20&%20wierder
(The ampersand is not encoded.) What did I do wrong?
-Wayne
Roedy Green - 06 Nov 2007 08:06 GMT > URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null); > System.out.println( uri.toURL() ); the way I read RFC 2396 is that reserved chars: ; / ? : @ & = + $ , are not supposed to be escaped. Perhaps Patricia could read the RFC and tell us what it really means.
I wish the people who write RFCs would provide examples to illustrate the true meaning of the lawyerese.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Owen Jacobson - 06 Nov 2007 21:32 GMT On Nov 6, 12:06 am, Roedy Green <see_webs...@mindprod.com.invalid> wrote:
> > URI uri = new URI("http", "//www.example.com/you& I 10%? wierd & wierder", null); > > System.out.println( uri.toURL() ); [quoted text clipped - 3 lines] > are not supposed to be escaped. Perhaps Patricia could read the RFC > and tell us what it really means. The character & is used in URLs and URIs to separate parts of the query, in which case it should be present as an actual & character. It can also occur inside query paramater names or values, in which case it should be present in aencoded form, as the string %26. The example URI Wayne gave uses (unintentionally) ampersands as query separators, which is why the URI class isn't escaping them; if he wants to use them as part of the path or part of query parameters or values he'll have to encode them himself with URLEncoder.encode(String, String) or similar.
Elsewhere within a URI, ampersands are not reserved and does not require encoding, except in the scheme part, where they're simply illegal.
> I wish the people who write RFCs would provide examples to illustrate > the true meaning of the lawyerese. The RFCs tend to be a codification of existing practice, rather than a prescription. In the case of the URI RFC it's a little vaguer, since URIs (that are not also URLs) aren't in terribly widespread use and came about as an attempt to normalize URLs, so the RFC could be seen as prescriptive rather than informative.
On the whole, this post-hoc RFC process works well: it gives the people creating prototypes time and freedom to play with ideas and discard the bad ones without prematurely codifying them in a standard. It's not perfect, but then, what is?
-Owen
Wayne - 06 Nov 2007 08:06 GMT >> On Tue, 06 Nov 2007 05:04:05 -0000, François >> <francois.x.hetu@gmail.com> wrote, quoted or indirectly quoted someone [quoted text clipped - 24 lines] > > -Wayne I guess the answer is to encode the query part separately, if needed. The following code seems to work:
public String encodeURL ( String initialURL, boolean parseQuery ) { // Parse the URL (without encoding): URL url = new URL( initialURL ); String scheme = url.getProtocol(); // E.g., "http" String authority = url.getAuthority(); // E.g., "//user@host:port" String path = url.getPath(); // E.g., "/foo/bar.htm" String query = url.getQuery(); // E.g., "foo=bar" (starts with '?") if ( parseQuery ) query = URLEncoder.encode( query, "UTF-8" ); String fragment = url.getRef(); // I.e., the "anchor"
// Assemble the encoded URL, using URI class to properly // encode each part: URI uri = new URI( scheme, authority, path, query, fragment ); return uri.toString(); }
-Wayne
Steven Simpson - 06 Nov 2007 09:33 GMT > encodedURL = (new URL(urlString)).toURI().toURL(); > [quoted text clipped - 10 lines] > http://www.javahhh.com/somephpfile.php?myunrequisitedlove=Java&mywife=PHP > The conversion from "&" to "&" is not relevant to URI encoding - it is HTML-encoding (and XML, etc). java.net.URI has no knowledge of this, and should not have. It does not know whether you're going to put the result into an HTML file or something else.
If you're writing out any literal text as part of HTML, including a URI with "&" in it, you independently need an encoding to map "&" to "&", "<" to "<", etc.
 Signature ss at comp dot lancs dot ac dot uk |
François - 06 Nov 2007 13:37 GMT Fellow Coders:
Thanks very much. Read all the fine answers, which put me on the right track.
The simple goal is to produce a sitemap.xml, with the url properly encoded (as specified at http://sitemaps.org/protocol.php).
An exemple, taken from this sitemaps.org:
http://www.example.com/?mlat.php&q=name should become http://www.example.com/%C3%BCmlat.php&q=name
(If the above is encoded by Google Group parser: ?mlaut => %C3%BCmlat and & => &)
The code snippet posted by Wayne works ok, but the ?mlaut stays an ?mlaut, since it is not part of the query.
And we can't URLEncode the whole string, since forward slashes and other valid characters will be transformed in UTF8 char codes.
The brute force way to do it (and the only way I found could work with the sitemap.org example) is to take the initial string, parse every single char, replace <, >, & and " with their escaped version (&, <, etc.) as Steven indicated, and finally test any remaining char to be within the range \u0000 to \u007F (the Basic Latin block) and encode any char outside that range with this class, taken straight out of the W3C website (http://www.w3.org/International/O-URL-code.html):
/**********************************************************/
public class URLUTF8Encoder {
final static String[] hex = { "%00", "%01", "%02", "%03", "%04", "%05", "%06", "%07", "%08", "%09", "%0a", "%0b", "%0c", "%0d", "%0e", "%0f", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1a", "%1b", "%1c", "%1d", "%1e", "%1f", "%20", "%21", "%22", "%23", "%24", "%25", "%26", "%27", "%28", "%29", "%2a", "%2b", "%2c", "%2d", "%2e", "%2f", "%30", "%31", "%32", "%33", "%34", "%35", "%36", "%37", "%38", "%39", "%3a", "%3b", "%3c", "%3d", "%3e", "%3f", "%40", "%41", "%42", "%43", "%44", "%45", "%46", "%47", "%48", "%49", "%4a", "%4b", "%4c", "%4d", "%4e", "%4f", "%50", "%51", "%52", "%53", "%54", "%55", "%56", "%57", "%58", "%59", "%5a", "%5b", "%5c", "%5d", "%5e", "%5f", "%60", "%61", "%62", "%63", "%64", "%65", "%66", "%67", "%68", "%69", "%6a", "%6b", "%6c", "%6d", "%6e", "%6f", "%70", "%71", "%72", "%73", "%74", "%75", "%76", "%77", "%78", "%79", "%7a", "%7b", "%7c", "%7d", "%7e", "%7f", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8a", "%8b", "%8c", "%8d", "%8e", "%8f", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9a", "%9b", "%9c", "%9d", "%9e", "%9f", "%a0", "%a1", "%a2", "%a3", "%a4", "%a5", "%a6", "%a7", "%a8", "%a9", "%aa", "%ab", "%ac", "%ad", "%ae", "%af", "%b0", "%b1", "%b2", "%b3", "%b4", "%b5", "%b6", "%b7", "%b8", "%b9", "%ba", "%bb", "%bc", "%bd", "%be", "%bf", "%c0", "%c1", "%c2", "%c3", "%c4", "%c5", "%c6", "%c7", "%c8", "%c9", "%ca", "%cb", "%cc", "%cd", "%ce", "%cf", "%d0", "%d1", "%d2", "%d3", "%d4", "%d5", "%d6", "%d7", "%d8", "%d9", "%da", "%db", "%dc", "%dd", "%de", "%df", "%e0", "%e1", "%e2", "%e3", "%e4", "%e5", "%e6", "%e7", "%e8", "%e9", "%ea", "%eb", "%ec", "%ed", "%ee", "%ef", "%f0", "%f1", "%f2", "%f3", "%f4", "%f5", "%f6", "%f7", "%f8", "%f9", "%fa", "%fb", "%fc", "%fd", "%fe", "%ff" };
/** * Encode a string to the "x-www-form-urlencoded" form, enhanced * with the UTF-8-in-URL proposal. This is what happens: * * <ul> * <li><p>The ASCII characters 'a' through 'z', 'A' through 'Z', * and '0' through '9' remain the same. * * <li><p>The unreserved characters - _ . ! ~ * ' ( ) remain the same. * * <li><p>The space character ' ' is converted into a plus sign '+'. * * <li><p>All other ASCII characters are converted into the * 3-character string "%xy", where xy is * the two-digit hexadecimal representation of the character * code * * <li><p>All non-ASCII characters are encoded in two steps: first * to a sequence of 2 or 3 bytes, using the UTF-8 algorithm; * secondly each of these bytes is encoded as "%xx". * </ul> * * @param s The string to be encoded * @return The encoded string */ public static String encode(String s) { StringBuffer sbuf = new StringBuffer(); int len = s.length(); for (int i = 0; i < len; i++) { int ch = s.charAt(i); if ('A' <= ch && ch <= 'Z') { // 'A'..'Z' sbuf.append((char)ch); } else if ('a' <= ch && ch <= 'z') { // 'a'..'z' sbuf.append((char)ch); } else if ('0' <= ch && ch <= '9') { // '0'..'9' sbuf.append((char)ch); } else if (ch == ' ') { // space sbuf.append('+'); } else if (ch == '-' || ch == '_' // unreserved || ch == '.' || ch == '!' || ch == '~' || ch == '*' || ch == '\'' || ch == '(' || ch == ')') { sbuf.append((char)ch); } else if (ch <= 0x007f) { // other ASCII sbuf.append(hex[ch]); } else if (ch <= 0x07FF) { // non-ASCII <= 0x7FF sbuf.append(hex[0xc0 | (ch >> 6)]); sbuf.append(hex[0x80 | (ch & 0x3F)]); } else { // 0x7FF < ch <= 0xFFFF sbuf.append(hex[0xe0 | (ch >> 12)]); sbuf.append(hex[0x80 | ((ch >> 6) & 0x3F)]); sbuf.append(hex[0x80 | (ch & 0x3F)]); } } return sbuf.toString(); }
}
/**********************************************************/
Are we having fun yet? In this particular case (a very common case), 1 line of PHP equals over 100 lines of Java. My KLOC just went through the roof and my employer suggested I take a very long vacation, in some remote location.
Thanks again all.
Daniel Pitts - 06 Nov 2007 15:59 GMT > Fellow Coders: > [quoted text clipped - 136 lines] > > Thanks again all. I would bet that you could find an Apache Commons API that does this for you!
Also, if you were using JSP technology, you could use <c:out value="${url}" escapeXml="true" /> So, to compare Java code with PHP doesn't make sense. PHP is /designed/ for HTML templates, Java is not. JSP is, so it has that same functionality that you expect from PHP.
Hope this helps, Daniel.
 Signature Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Steven Simpson - 06 Nov 2007 16:08 GMT > An exemple, taken from this sitemaps.org: > [quoted text clipped - 9 lines] > of the W3C website (http://www.w3.org/International/O-URL-code.html): > You should really do the %-encoding first, then the &;-encoding, for symmetry with parsing at the other end, where it will be &;-decoded first, then %-decoded.
> public class URLUTF8Encoder > [quoted text clipped - 5 lines] > * > I think this is too much at this stage. This space->plus conversion, and its corresponding '+'->'%XX', must have already been done in order to form the URI; you can't do it to a complete URI, as the spaces that became pluses when it was formed will then become %XX.
If you already have the URI, all you're doing now is making it ASCII compatible.
> Are we having fun yet? In this particular case (a very common case), > 1 line of PHP equals over 100 lines of Java. No, I've just noticed URI.toASCIIString()!
 Signature ss at comp dot lancs dot ac dot uk |
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|