Java Forum / General / March 2007
Zero Byte Terminated Strings
PurpleServerMonkey - 28 Mar 2007 04:37 GMT Hi,
I'm writting a simple UDP server in Java, it's designed to take an initial request packet from a C based client and perform further actions. The networking side of things is fine however I'm having problems dealing with a zero byte terminated string being sent from the client.
Code Snippet: byte[] data = new byte[1000]; DatagramSocket serverSocket = new DatagramSocket(1025); DatagramPacket packet = new DatagramPacket(data, data.length); serverSocket.receive(packet);
The recieved packet then gets put onto a queue for pickup by a thread pool. It's in the threadpool that I look at processing the packet and extracting the string information (represents a filename, mode, etc). Note that the strings in this packet are zero byte terminated.
Code Snippet: byte[] payload = new byte[1000]; payload = packet.getData();
What I'd like to know is, what is the best way to retrive zero byte terminated strings from the byte array?
Thanks in advance for your assistance.
Knute Johnson - 28 Mar 2007 05:00 GMT > Hi, > [quoted text clipped - 23 lines] > > Thanks in advance for your assistance. Actually very easy to do. Just create a String from your byte[] buffer and split it on the 0s.
public class test { public static void main (String[] args) throws Exception { byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00, 0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };
String str = new String(buf); String[] arr = str.split("\u0000");
for (int i=0; i<arr.length; i++) System.out.println(arr[i]); } }
 Signature Knute Johnson email s/nospam/knute/
Adam Maass - 28 Mar 2007 05:51 GMT > Actually very easy to do. Just create a String from your byte[] buffer > and split it on the 0s. [quoted text clipped - 5 lines] > > String str = new String(buf); Ahem, it will be critically important to specify the encoding to the String constructor!
String str = new String(buf, "ASCII");
> String[] arr = str.split("\u0000"); > > for (int i=0; i<arr.length; i++) > System.out.println(arr[i]); > } > } Knute Johnson - 28 Mar 2007 06:44 GMT >> Actually very easy to do. Just create a String from your byte[] >> buffer and split it on the 0s. [quoted text clipped - 17 lines] >> } >> } Only if he doesn't want his system default character set. Mine certainly doesn't default to ASCII, or as it is more correctly known ANSI_X3.4-1968. What character set does your C compiler default to?
 Signature Knute Johnson email s/nospam/knute/
PurpleServerMonkey - 28 Mar 2007 06:56 GMT On Mar 28, 3:44 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> wrote:
> >> Actually very easy to do. Just create a String from your byte[] > >> buffer and split it on the 0s. [quoted text clipped - 26 lines] > Knute Johnson > email s/nospam/knute/ Thanks guys, that has worked a treat.
The client is an old C based application and is using ASCII encoding, the above info has solved the problem and is working well.
Chris Uppal - 28 Mar 2007 07:00 GMT > > Ahem, it will be critically important to specify the encoding to the > > String constructor! [..]
> Only if he doesn't want his system default character set. Mine > certainly doesn't default to ASCII, or as it is more correctly known > ANSI_X3.4-1968. What character set does your C compiler default to? But using the Java system default charset is almost always going to be a bad mistake in this situation. Or do you have a good reason to believe that the default charset of the C compiler installation where the code which generates the UDP packets was complied will be the same[*] as the default Java charset set on the system where the UDP packets are received ?
([*] Note: that is "will be the same", not "is likely to be the same").
Using the default system charset for real data, in production code, is nothing better than lazy and incompetent.
-- chris
Knute Johnson - 28 Mar 2007 07:32 GMT >>> Ahem, it will be critically important to specify the encoding to the >>> String constructor! [quoted text clipped - 15 lines] > > -- chris You know I don't like being called lazy and incompetent this late in the evening. The other fellow mentioned nothing about the character set he was using. Picking one out of a hat is no better than using the system default. Odds are pretty good that system defaults will be the same if used on the same computer, albeit different compilers. Specifying the wrong character set may very well cause it to not work at all. If he said gee this doesn't work for my Chinese clients, they get a bunch of ?????? then you can deal with his character set problems. Or you can force his Chinese clients to use ANSI_X3.4-1968 and they will get ?????? right off the bat.
It's late and this lazy incompetent is going to bed now.
 Signature Knute Johnson email s/nospam/knute/
Chris Uppal - 28 Mar 2007 07:39 GMT [me:]
> > Using the default system charset for real data, in production code, is > > nothing better than lazy and incompetent. [quoted text clipped - 3 lines] > You know I don't like being called lazy and incompetent this late in the > evening. You won't see this until tomorrow, and I suppose you'll like it even less then. But I'm afraid that I'm going to stick by my comment, and if -- by implication -- it applies to you, then that's unfortunate because I had meant nothing personal, but I will also stand by the implications.
-- chris
Knute Johnson - 28 Mar 2007 16:12 GMT > [me:] >>> Using the default system charset for real data, in production code, is [quoted text clipped - 10 lines] > > -- chris Computer programs are tools, just like any other tool. They have a cost and a benefit. You can buy a rusty box-end wrench or you can buy a gold plated spanner. They do the same job most of the time. To say that you absolutely have to use the gold plated spanner and that you are lazy and incompetent if you don't is just plain rude.
If the default character set wasn't adequate for his purposes he could easily change it. As it turns out he was happy with the solution provided and it worked just fine.
And now I'm going to take my lazy butt to town.
 Signature Knute Johnson email s/nospam/knute/
Chris Uppal - 28 Mar 2007 05:57 GMT > What I'd like to know is, what is the best way to retrive zero byte > terminated strings from the byte array? There is no easy way to do it. That's to say, the /code/ will be trivially simple once you know what you have to do, but finding out what you have to do will be tricky unless the C programmers who generate the input are unusually knowledgeable.
There is no equivalence between character data and binary data, so one is always turned into the other by using some character encoding or other (often called a "charset" or a "code page"). In Java, when you convert bytes to text (or vice versa) you /always/ have to tell the system what character encoding to use. (There are some "convenience" methods which use a system-default code page, but you should avoid those in most circumstances, and you should /definitely/ avoid them in this case).
So how do you find out what character set has been used by the C programmers ? The first thing to do is to ask them. The chances are fairly good that they'll have no idea what you are talking about. If not, then presumably they haven't taken any steps at all to /control/ what code page is being used, and it will be either: some system default, if they are generating the text themselves or whatever character set the /real/ source of the data used.
If they are generating the data themselves, then you can probably get a decent guess as to what character set they are using by running the following little Java programs on the machine where they compile their stuff.
public class Main { public static void main(String[] args) { System.out.println( "file.encoding: " + System.getProperty("file.encoding")); } }
That will tell you what character set Java thinks is most likely to be a sensible default for that machine, and it /may/ be correct. On my system today, that name is "Cp1252" (which cognoscenti will recognise as meaning I have a Windows box set up to use an English/Western European character set by default).
If you can't find any sensible information, then it's probably a good idea to assume that the data is pure ASCII -- which is a 7-bit encoding which (therefore) only defines 127 characters, but those 127 characters are common to all (as far as I know) encodings that your UDP packets are likely to be using. To use that character encoding use an encoding name of "US-ASCII".
Once you have decided what character set is in use, actually decoding it is trivial. Just find the start of the text data in your byte[] buffer (which you must already know how to do), loop down the buffer looking for the terminating byte which has value 0 (but see below), and then pass the resulting data into the String constructor: String(byte[] bytes, int offset, int length, String charsetName) or, if you prefer: String(byte[] bytes, int offset, int length, Charset charset) which will do the conversion for you.
(The potential gotcha about looking for the value 0 is that it assumes that the data is encoded using an 8-bit (or 7-bit) encoding like "ISO-8859-1", "UTF-8", or "Cp1252", rather than a 16-bit encoding like "UTF-16" -- but that seems a safe bet or even C programmers would know that there was a potential problem and warn you about it.)
If you can, I'd advise getting the C people to send a packet containing /all/ the potential 254 non-zero characters, and then compare what you decode it as with what they expect it to look like. Needless to say, you'll have to be careful about character encoding issues when you do the comparison...
-- chris
Adam Maass - 28 Mar 2007 15:32 GMT >> What I'd like to know is, what is the best way to retrive zero byte >> terminated strings from the byte array? [quoted text clipped - 98 lines] > > -- chris Thank you Chris for a thorough, thoughtful, and detailed response.
If you expect 0-byte terminated strings, you absolutely need to know the character encoding in use; some of the more exotic encodings (to those of us using Latin charsets) will contain 0-bytes that do not indicate the end of a string. If you don't specify the charset and operate on a system that defaults to one of these "exotic" encodings, then the String(byte[]) constructor will not do what you expect.
In short, when dealing with raw bytes that represent character data, you need to know what encoding was used to generate the bytes.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|