Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / September 2007

Tip: Looking for answers? Try searching our database.

Read binary data file

Thread view: 
Windsor.Locks@gmail.com - 29 Aug 2007 20:52 GMT
I am a C++ programmer, working on a java program. I need to read a
binary file using Java.

Here is how I read it in C++,

Struct SOME_DATA
{
unsigned long data1;
unsigned short data2;
unsigned short data3;
unsigned long data4;
}

struct SOME_DATA someData;

and read using

fread(&someData, 12, 1, inputFile);

Please give me some pointers, how do i read this using Java? Thanks.
BTW, those are not the variable names I use in my program.
Joshua Cranmer - 29 Aug 2007 21:07 GMT
> I am a C++ programmer, working on a java program. I need to read a
> binary file using Java.

InputStream is = new FileInputStream("file/name.txt");
byte[] data = new byte[12];
is.read(data);

That reads 12 bytes of data into data. Alternatively, you can grab
byte-by-byte or use only part of the buffer. See the JavaDocs for
java.io.InputStream for more information.

Signature

Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

shakah - 29 Aug 2007 21:37 GMT
On Aug 29, 3:52 pm, Windsor.Lo...@gmail.com wrote:
> I am a C++ programmer, working on a java program. I need to read a
> binary file using Java.
[quoted text clipped - 18 lines]
> Please give me some pointers, how do i read this using Java? Thanks.
> BTW, those are not the variable names I use in my program.

It's never a good idea portability-wise to write structs in binary
format (e.g. how do you deal with packing, different CPU
architectures, etc.?), but ignoring that for now you could naively do
something like the following. Note that this only works on big-endian
machines, and is probably unreliable there anyway.

jc@soyuz:~/tmp/binrw$ cat main.cpp
#include <stdio.h>

int main(int /*argc*/, char **argv) {
 struct SOME_DATA {
   unsigned long data1 ;
   unsigned short data2 ;
   unsigned short data3 ;
   unsigned long data4 ;
 } ;

 SOME_DATA someData = { 1, 2, 3, 4 } ;

 FILE *fh = fopen(argv[1], "wb") ;
 fwrite(&someData, sizeof(someData), 1, fh) ;
 fclose(fh) ;

 return 0 ;
}
jc@soyuz:~/tmp/binrw$ g++ -W -Wall -pedantic -o test main.cpp
jc@soyuz:~/tmp/binrw$ ./test test2.file
jc@soyuz:~/tmp/binrw$ cat test.java
public class test {
 public static void main(String [] args)
   throws java.io.IOException {
   java.io.DataInputStream dis
      = new java.io.DataInputStream(
          new java.io.FileInputStream(
            new java.io.File(
              args[0]
            )
          )
        ) ;
   System.out.println("data1: " + dis.readInt()) ;
   System.out.println("data2: " + dis.readShort()) ;
   System.out.println("data3: " + dis.readShort()) ;
   System.out.println("data4: " + dis.readInt()) ;
 }
}
jc@soyuz:~/tmp/binrw$ javac test.java
jc@soyuz:~/tmp/binrw$ java -classpath . test test2.file
data1: 1
data2: 2
data3: 3
data4: 4

For reference, duplicating the above on an Intel box yields:
jc@jc-ubuntu:~/tmp/binrw$ java test test.file
data1: 16777216
data2: 512
data3: 768
data4: 67108864
Windsor.Locks@gmail.com - 29 Aug 2007 21:54 GMT
> On Aug 29, 3:52 pm, Windsor.Lo...@gmail.com wrote:
>
[quoted text clipped - 26 lines]
> something like the following. Note that this only works on big-endian
> machines, and is probably unreliable there anyway.

Thanks for your reply. I do not have any say in the file format or how
the file is written. My requirement is read this file and get the data
out of it. There is nothing more I can do.
Hunter Gratzner - 29 Aug 2007 22:13 GMT
On Aug 29, 10:54 pm, Windsor.Lo...@gmail.com wrote:
> Thanks for your reply. I do not have any say in the file format or how
> the file is written. My requirement is read this file and get the data
> out of it. There is nothing more I can do.

Then the one "defining" this data format has no f.cking clue. C/C++
structs have no well defined binary layout, except the order of
elements. C/C++ integer data types have no well defined binary
representation and no well defined size, except a minimum value range.
~kurt - 29 Aug 2007 23:54 GMT
> On Aug 29, 10:54 pm, Windsor.Lo...@gmail.com wrote:
>> Thanks for your reply. I do not have any say in the file format or how
[quoted text clipped - 5 lines]
> elements. C/C++ integer data types have no well defined binary
> representation and no well defined size, except a minimum value range.

And you are missing the point.  There are many legacy systems out there
that make plenty of assumptions, and have been working just fine for longer
than Java has even existed.

Instead of telling us the one defining the data format has no clue (which
you are wrong about), why don't you explain your solution to writing a binary
file in C/C++, FORTRAN, or whatever, that will solve all the academic issues
you have just brought up.

Reading binary files is almost always tricky, especially when you move from
one platform, OS, or language to the next.  There is no way to circumvent
this.  It is the price you pay to have the data in a binary format.  Java
does, at least, make it portable across platforms and OSs - but not languages.
If you are reading a binary file created outside of Java, then you are going
to need to create a custom reader for this data.  What is really annoying is
when you don't even know the endian or the size of the values (16 bit,
32 bit?) and need to experiment to get it right.  

I've had to do this numerous times myself.  The worst was for one application
that was written in a version of FORTRAN that would put an arbitrary sized
(arbitrary as far as I could tell) header after each record that was written
(turning off this header was a compile time option, that I seem to remember
would make accessing the file less efficient, or something).  I wanted to
read it directly into Matlab - not an easy task.

- Kurt
Lew - 30 Aug 2007 02:22 GMT
Windsor.Locks@gmail.com wrote:
>>> I do not have any say in the file format or how
>>> the file is written. My requirement is read this file and get the data

<http://java.sun.com/javase/6/docs/api/java/nio/ByteBuffer.html>
<http://java.sun.com/javase/6/docs/api/java/nio/ByteBuffer.html#order(java.nio.By
teOrder
)>

Signature

Lew

Hunter Gratzner - 30 Aug 2007 20:23 GMT
> > On Aug 29, 10:54 pm, Windsor.Lo...@gmail.com wrote:
> >> Thanks for your reply. I do not have any say in the file format or how
[quoted text clipped - 7 lines]
>
> And you are missing the point.

No, I don't. A C struct is not a suitable, unambiguous format
specification, binary or otherwise. That's the whole point. Giving
someone just a C struct and telling him to implement it in Java is a
pointless stupid act. It indicates that the one giving this file
format "definition" has no f.cking clue what he is doing.

> Instead of telling us the one defining the data format has no clue (which
> you are wrong about), why don't you explain your solution to writing a binary
> file in C/C++, FORTRAN, or whatever, that will solve all the academic issues
> you have just brought up.

It did that previously in this same thread, but you are apparently
more interested in picking a fight.

> Reading binary files is almost always tricky, especially when you move from
> one platform, OS, or language to the next.  There is no way to circumvent
> this.

Sure it is. By having an unambiguous format specification. A C struct
is not an unambiguous format specification.

> It is the price you pay to have the data in a binary format.

No, it is the price to pay when some fuckwit thinks that writing C
structs 1:1 to memory is a good idea.

There is no difference between a binary and a text format if you need
to move between platforms. Either the format is unambiguously defined,
then it's a straight forward job to implement it, or it isn't.

> What is really annoying is
> when you don't even know the endian or the size of the values (16 bit,
> 32 bit?) and need to experiment to get it right.

And why do you then think a C struct is a good definition of a binary
format?
Mike  Schilling - 30 Aug 2007 23:38 GMT
>  C/C++ integer data types have no well defined binary
> representation and no well defined size, except a minimum value range.

And the presence or absence of between-field padding isn't always
guaranteed.   Still, if the files don't have to be cross-platform, reading
and writing structs will work just fine.  Note: the *application* can be
portable across platforms, so long as the (for example) Solaris/Sparc
version won't have to read files written by the Windows/Intel version.
~kurt - 31 Aug 2007 01:18 GMT
> No, I don't. A C struct is not a suitable, unambiguous format
> specification, binary or otherwise. That's the whole point. Giving
> someone just a C struct and telling him to implement it in Java is a
> pointless stupid act. It indicates that the one giving this file
> format "definition" has no f.cking clue what he is doing.

It is hardly pointless.  Most of the time, there is no format specification
because binary data is often not written with the intention of being used
outside of the application that writes it.  Only later does an outside user
have a need for the data, and then one has to often reverse engineer a
solution.  A C struct at least gives you an idea as to what type of data is in
the file.  Knowing what platform it was written in helps out even more.

>> Instead of telling us the one defining the data format has no clue (which
>> you are wrong about), why don't you explain your solution to writing a binary
[quoted text clipped - 3 lines]
> It did that previously in this same thread, but you are apparently
> more interested in picking a fight.

I'm put off by your attitude that what the OP has to work with is due to
someone who has no clue.  If you are saying a C structure makes a bad
ICD, then I agree with you.  But, binary files are often not written with
portability in mind, and the implementation details exist only in the code
that reads/writes the data.  There is nothing wrong with that when the
original intent of the data was for internal use only - and that is often
the case.  Then, seeing how the data is read into a C structure is invaluable.

The soultion I saw you post was an example of how to read the data.  I didn't
see anything but bitching regarding the data source.

> No, it is the price to pay when some fuckwit thinks that writing C
> structs 1:1 to memory is a good idea.

It is often the only reasonable idea, depending on the orignal intent of
the data.  Like I said, I didn't see a better solution posted by you
on how to do this.  Creating unecessary ICDs is a bad thing.

> And why do you then think a C struct is a good definition of a binary
> format?

It works as good as anything else for many uses.  If you write a specification
describing how many bytes a number is supposed to take up, and the endian, and
the data is only to be used internally, then you are creating extra work for
youself when you port the code to other platforms (of course, you want to call
sizeoff() when reading in the structure instead of hard coding the size).

- Kurt
Charles - 31 Aug 2007 05:41 GMT
> > No, I don't. A C struct is not a suitable, unambiguous format
> > specification, binary or otherwise. That's the whole point. Giving
[quoted text clipped - 45 lines]
>
> - Kurt

Dear Friends (when did you guys become my friends?)

Let's review what the OP stated

A struct is given in C++

Data needs to read from a file in Java.

You have the following data types

unsigned long
unsigned short

As previously stated by other posters the Endianness of the operating
system should affect how the output file is encoded. I assume this to
be true but have not verified it to be true.

We assume all unsigned longs and unsigned short will ALWAYS have the
same bytesize.

The complete struct is given as

unsigned long data1;
unsigned short data2;
unsigned short data3;
unsigned long data4;

Can we also assume that the data will always be sequenced as described
in the STRUCT?
I don't see any argument why the data will be out of sequence as
defined in the STRUCT.

Does the input file get modified when it is transported from one
operating system to another?
I assume NO. This is not verified.

Are there equivalents of unsigned long and unsigned short in Java?
Are they the same byte size?
Do they encode the data the same?

Try to read in Java and verify with known data. If you don't know any
of the data values this becomes a harder task.
Lew - 31 Aug 2007 12:21 GMT
> Let's review what the OP stated
>
[quoted text clipped - 25 lines]
> I don't see any argument why the data will be out of sequence as
> defined in the STRUCT.

But we do not know the padding, and the OP doesn't know what those sizes are,
nor the endianness of their files.  They don't even know in what format the
floating-point values are stored: IEEE?  We need all that information to craft
a Java equivalent, and we don't have it.  The OP doesn't have it, by their
account.

> Does the input file get modified when it is transported from one
> operating system to another?
> I assume NO. This is not verified.

But if endianness and padding matter, the fact that it is not modified will
make it unreadable on the second system.

> Are there equivalents of unsigned long and unsigned short in Java?

No.

> Are they the same byte size?

We do not know.  The OP hasn't given us enough information.

> Do they encode the data the same?

We do not know.  The OP hasn't given us enough information.

> Try to read in Java and verify with known data. If you don't know any
> of the data values this becomes a harder task.

It's already impossible based on the information given.  How much harder can
it get?

Signature

Lew

Martin Gregorie - 31 Aug 2007 13:53 GMT
> It's already impossible based on the information given.  How much harder
> can it get?

If the OP *MUST* move binary data, at least do it in a platform and
language-independent manner and use ASN.1 encoding.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

~kurt - 01 Sep 2007 02:18 GMT
> If the OP *MUST* move binary data, at least do it in a platform and
> language-independent manner and use ASN.1 encoding.

I understand Hunter's comments, and and while I don't know much about
ASN.1 encoding, what I am pointing out is that binary files are usually
*not* intended to be used across sytems.  Every binary data file I have
ever worked with was intended to be used either by the program that wrote
it, or separate applications that used the same utility libraries as the
application which wrote the data.  There is nothing wrong with simply writing
the C structure to a file, and reading it in the same way.  In this case
the code, and not some specification, drives the format of the data - and there
is *nothing* wrong with this.  The lack of a need to share the data outside of
the application is what often drives the decision to use binary data in the
first place (why not take advantage of the efficiency binary files have to
offer).

Of course, every once in a while an outside user decides they want to use this
data.  Well, then they have a choice.  Either generate it themselves, or
spend a few hours writing something that can read it in - not a big price
to pay.

- Kurt
Esmond Pitt - 01 Sep 2007 11:45 GMT
> I understand Hunter's comments, and and while I don't know much about
> ASN.1 encoding, what I am pointing out is that binary files are usually
> *not* intended to be used across sytems.

Except for all the ones that are, e.g. protocol dumps; databases;
interpretive pseudo-code (e.g. .class files), ...

>  Every binary data file I have
> ever worked with was intended to be used either by the program that wrote
> it, or separate applications that used the same utility libraries as the
> application which wrote the data.

Except for the ones that aren't: e.g. protocol dumps; databases;
interpretive pseudo-code (e.g. .class files), ...

>  There is nothing wrong with simply writing
> the C structure to a file, and reading it in the same way.  In this case
> the code, and not some specification, drives the format of the data - and there
> is *nothing* wrong with this.

There is plenty wrong with this. The format of binary data written
directly from a struct in memory depends on at least the following:

- the host hardware
- the compiler
- the compiler version
- the surrounding #pragmas
- the compiler options that were in effect when the binary that wrote
the file it was compiled

This is too many dependencies, on too many things that can't be controlled.

The only time writing a struct from memory to a file or a network can
sanely be justified is when the target application is constructed with
the same version of the same object file that wrote it. And this is not
a guarantee that in general can be met.
Mike Schilling - 01 Sep 2007 16:55 GMT
>> I understand Hunter's comments, and and while I don't know much about
>> ASN.1 encoding, what I am pointing out is that binary files are
>> usually *not* intended to be used across sytems.
>
> Except for all the ones that are, e.g. protocol dumps; databases;
> interpretive pseudo-code (e.g. .class files), ...

How often to database *files* get moved from one system to another?  In my
experience, they stay on the server where the DBMS engine is running.
Arne Vajhøj - 01 Sep 2007 22:59 GMT
>>> I understand Hunter's comments, and and while I don't know much about
>>> ASN.1 encoding, what I am pointing out is that binary files are
[quoted text clipped - 4 lines]
> How often to database *files* get moved from one system to another?  In my
> experience, they stay on the server where the DBMS engine is running.

It has been attempted occasionally.

It is usually not supported and often it does not work.

Arne
~kurt - 01 Sep 2007 19:12 GMT
> The only time writing a struct from memory to a file or a network can

Who is talking about writing data to a network?

> sanely be justified is when the target application is constructed with
> the same version of the same object file that wrote it. And this is not
> a guarantee that in general can be met.

Uh, this is pretty much what I just said other than I see no need for
the "guarantee" part - it is not necessary unless the *intent* is to
distribute the data externally.

As I said, my gripe is in calling the originator of the OP's data clueless.
That statement is simply clueless itself.  Yes, if the original program had
been written in Java, then maybe that statement would be true.  But this
is a C++ program.  The data files are most likely "private", only to be
used internally.  Sure, if you port the code to another platform, the
binary files between the two versions may not be compatible, but so what -
that usually isn't a problem.  The new code will create binary files that
are compatible with itself.  Creating some external specification that this
binary data must meet would be stupid because then, if you did port the
code, now you may have to modify it to be compatible with the original
specification, and this may require more processing of the data.  Suddenly,
some specification is driving internal data, and robbing some degree of
performance from the application.

Just because a bureaucrat comes a long some time down the road and says
"though shalt write a Java program (not that Java is the best solution in
this case, but because it is the 'in' thing to do) that will use Program X's
internal data files" does not mean Program X was poorly designed.

- Kurt
Mike Schilling - 01 Sep 2007 19:39 GMT
>> The only time writing a struct from memory to a file or a network can
>
[quoted text clipped - 23 lines]
> internal data, and robbing some degree of performance from the
> application.

The danger is that a different compiler (or different version of the same
compiler) would cause an incompatibility. The good news is that compiler
vendors tend not to change struct layouts for that very reason.  Still, this
needs to be kept in mind and tested for whenever that sort of change is
made.

Another point, not yet mentioned (or if it has been, I  missed that post.)
Any structured data that's saved persistently should contain a version
number.  If it never changes, you've added a small amount of overhead.  When
it does change, it's now straightforward to convert older versions and
recognize new ones, which, without the explicit versioning, can be difficult
or impossible.
Martin Gregorie - 01 Sep 2007 23:42 GMT
> The danger is that a different compiler (or different version of the same
> compiler) would cause an incompatibility. The good news is that compiler
> vendors tend not to change struct layouts for that very reason.  Still, this
> needs to be kept in mind and tested for whenever that sort of change is
> made.

Actually, there's a more subtle way of failing that can bite an
executable that reloads data that it wrote itself: there's not
necessarily a guarantee that the chunks of data will be read back to the
same virtual memory address that it was saved from so it had better not
contain pointers that are expected to remain valid.

I've been there: I had a program that did lookups on a few hundred
million phone numbers. It used a B-tree for in-memory lookups: the same
lookup using a database wouldn't run faster than 700 lookups/second and
we needed 3000, hence the B-tree which ran at 25,000/second. BUT startup
took 40 minutes to populate the B-tree from the database, so I saved the
B-tree by simply dumping its dataspace to files that were reloaded on
startup. The B-tree grew continuously, so it was split over a number of
multi-megabyte memory chunks: each was written to a separate file.
Reloading these reduced startup time to under 5 minutes. However, the
first iteration merely crashed because the OS (a Mach-based UNIX) didn't
reload the chunks into the same places in my process's virtual memory,
so the pointers were so much junk. FWIW the fix was to replace standard
pointers with my own addressing scheme: this occupied the same space,
but replaced pointers with structs containing two fields,
chunkno:chunk_offset. This sidestepped the problem and ran acceptably fast.

I know this is somewhat OT for c.j.j.p but knowing about it may save
somebody's hide one of these days.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Mike Schilling - 02 Sep 2007 00:39 GMT
> I've been there: I had a program that did lookups on a few hundred
> million phone numbers. It used a B-tree for in-memory lookups: the
[quoted text clipped - 14 lines]
> chunkno:chunk_offset. This sidestepped the problem and ran acceptably
> fast.

On some OS's you could have created a memory-mapped file at whatever address
you provided, which lets you both use absolute addresses and avoid the
startup overhead by letting the file page itself in.  Yours is a nice "with
simple tools" solution.
Gordon Beaton - 02 Sep 2007 07:44 GMT
> On some OS's you could have created a memory-mapped file at whatever
> address you provided, which lets you both use absolute addresses and
> avoid the startup overhead by letting the file page itself in. Yours
> is a nice "with simple tools" solution.

There are many components that make up the address space of an
application, and there is no guarantee that the same block of
addresses will always be available to the application. A program that
depends on that particular feature of mmap() is extremely fragile and
can't be expected to work across upgrades of the software or any of
the libraries it depends on. That might be ok for hobby projects, but
I'd never ship such a beast to a customer.

/gordon

--
Mike Schilling - 02 Sep 2007 08:12 GMT
>> On some OS's you could have created a memory-mapped file at whatever
>> address you provided, which lets you both use absolute addresses and
[quoted text clipped - 8 lines]
> the libraries it depends on. That might be ok for hobby projects, but
> I'd never ship such a beast to a customer.

I'm not really familiar with mmap(); wouldn't it be possible to choose a
starting address well out of the possible end address of the application
proper?    I was actually thinking of VMS, where the address could be in a
part of virtual memory that isn't used by the application at all.

In any case, if it's possible to allocate enough contiguous virtual memory
at some location, all that's needed is to adjust the stored addresses by the
difference [1], and you can still page the file in as needed.  If you're not
sure of contiguous memory, you effectively have the OP's solution of (chunk,
offset) pairs.

Though if you're doing this, it's more logical to store offsets to the start
of the file rather than addresses.
Martin Gregorie - 01 Sep 2007 13:40 GMT
> I understand Hunter's comments, and and while I don't know much about
> ASN.1 encoding, what I am pointing out is that binary files are usually
> *not* intended to be used across sytems.

I think its use is quite industry-dependent: I've never seen it used in
financial messaging (that's more likely to use SWIFT formats, which are
tagged text) but its common in the telecommunications industry.

Telcos (both fixed line and mobile) use a lot of binary data for control
and accounting purposes, mainly because this minimizes message size and
there's a LOT of stuff flying around controlling the network in real
time and accounting for its use. Switches from large vendors, e.g.
Erickson, tend to use proprietary, flat message formats but if the data
will be exchanged between different types of kit (e.g. roaming billing
data) they tend to use ASN.1: CCITT likes it.

ASN.1 has a lot in common with XML in that its a tagged field protocol,
allows nesting, and uses a tag dictionary to associate meanings with
tags. Compared with XML its a LOT more compact (tags are one byte, fixed
length fields don't have terminators, variable length fields are
preceded by a one or two byte length) and it has a number of predefined
field types as well as arrays. If you have the dictionary its easy to
interpret on the fly though, like XML, you can also use the dictionary
to generate code to encode and decode ASN.1 records.

> Every binary data file I have
> ever worked with was intended to be used either by the program that wrote
> it, or separate applications that used the same utility libraries as the
> application which wrote the data.

There's also a lot of binary data in large commercial systems. Formerly
it was in large serial files, then flat indexed files, now its probably
in a database. A really good reason for using an RDBMS is that it not
only hides implementation details (like endian conventions) from the
application, but the interfaces (SQL, JDBC, ODBC, etc) typically provide
field conversion facilities.

> There is nothing wrong with simply writing
> the C structure to a file, and reading it in the same way.

I'd probably use a CSV format any place where a database would be
obvious overkill, but ymmv.

Using CSV rather than binary makes debugging easier and (said with his
*NIX hat on) it allows the data to be handled by common scripted
utilities like awk, perl and even shell scripts. Oh yeah, Java too :-)

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Nigel Wade - 03 Sep 2007 10:11 GMT
>> If the OP *MUST* move binary data, at least do it in a platform and
>> language-independent manner and use ASN.1 encoding.
[quoted text clipped - 5 lines]
> it, or separate applications that used the same utility libraries as the
> application which wrote the data.  

Pretty much all scientific data I have worked with over the past 25 years has
been written in binary, and is intended to be read on just about any platform
you'd care to use. The basic principle behind being able to do this is writing
the binary data in a well structured form, in a reliable and portable way.

> There is nothing wrong with simply writing
> the C structure to a file, and reading it in the same way.  

There is everything wrong with this. This is the fundamental problem. The amount
of padding which is used internally within a struct is undefined by the
language - it is entirely up to the compiler developer. If you write a struct
in binary both the data *and the padding* will be output together, all
intermingled. Further, since the amount of padding is at the discretion of the
compiler writers they are free to change the amount they use in any release of
their compiler. So you could quite easily find that an upgrade to the compiler
causes your code, which you say is perfectly acceptable, to break even on the
same hardware and OS.

> In this case
> the code, and not some specification, drives the format of the data - and there
> is *nothing* wrong with this.  

Yes there is. Code which writes unspecified data to a binary file is bad code.
It will almost certainly break at some time in the future.

> The lack of a need to share the data outside of
> the application is what often drives the decision to use binary data in the
> first place (why not take advantage of the efficiency binary files have to
> offer).

But it is wise to know what is being written into your binary file so that you
can reliably read it back in. Otherwise it's reverse GIGO, it's GOGI - garbage
out, garbage in.

> Of course, every once in a while an outside user decides they want to use this
> data.  Well, then they have a choice.  Either generate it themselves, or
> spend a few hours writing something that can read it in - not a big price
> to pay.

But somewhat difficult if the original program's author didn't know what they
were writing into their binary files. I

Signature

Nigel Wade, System Administrator, Space Plasma Physics Group,
           University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw@ion.le.ac.uk
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555

Lew - 03 Sep 2007 15:12 GMT
~kurt wrote:
>> There is nothing wrong with simply writing
>> the C structure to a file, and reading it in the same way.  

> There is everything wrong with this. This is the fundamental problem. The amount
> of padding which is used internally within a struct is undefined by the
[quoted text clipped - 5 lines]
> causes your code, which you say is perfectly acceptable, to break even on the
> same hardware and OS.

A point which has been made several times in this thread.

>> In this case
>> the code, and not some specification, drives the format of the data - and
> there
>> is *nothing* wrong with this.  

> Yes there is. Code which writes unspecified data to a binary file is bad code.
> It will almost certainly break at some time in the future.

Most emphatically.

>> The lack of a need to share the data outside of
>> the application is what often drives the decision to use binary data in the
>> first place (why not take advantage of the efficiency binary files have to
>> offer).

> But it is wise to know what is being written into your binary file so that you
> can reliably read it back in. Otherwise it's reverse GIGO, it's GOGI - garbage
> out, garbage in.

Another point which has been made several times in this thread, in various ways.

>> Of course, every once in a while an outside user decides they want to use this
>> data.  Well, then they have a choice.  Either generate it themselves, or
>> spend a few hours writing something that can read it in - not a big price
>> to pay.

> But somewhat difficult if the original program's author didn't know what they
> were writing into their binary files. I

Which is why we keep advising the OP (who seems to have lost interest in their
question) to determine exactly what that format they're using, then to code to
that specification.  This point seems to have been lost repeatedly.

I would love for the OP to chime in and let us know that they've done this
step.  How 'bout it, Windsor.Locks?  Any luck with that analysis?  What did
you find?

Signature

Lew

Nigel Wade - 03 Sep 2007 17:08 GMT
> ~kurt wrote:
>>> There is nothing wrong with simply writing
[quoted text clipped - 11 lines]
>
> A point which has been made several times in this thread.

I know.

But certain posters in the thread still seem to be lacking the necessary clue.
So continuing to hit them again and again with the same clue-stick the message
might eventually begin to sink in.

Maybe we need to introduce lines, write 1000 times (without using the cut-paste
buffer):  "I must not write C structs to binary files".

As to reading binary data, I prefer to use ByteBuffer to handle
big-/little-endian issues. Although it might not be particularly efficient for
reading large quantities of binary data it is convenient, reasonably
transparent, and it's part of the standard API so should always be available.

Signature

Nigel Wade, System Administrator, Space Plasma Physics Group,
           University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw@ion.le.ac.uk
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555

Mike Schilling - 29 Aug 2007 21:45 GMT
> I am a C++ programmer, working on a java program. I need to read a
> binary file using Java.
[quoted text clipped - 17 lines]
> Please give me some pointers, how do i read this using Java? Thanks.
> BTW, those are not the variable names I use in my program.

Java doesn't allow you to read into (or write from) a structure this way.
Say you create a Java class:

class SomeData
{
long data1;
short data2;
short data3;
long data4;
}

Unlike in C or C++, there's really no defined order for the fields, and thus
no way to issue one read that fills all of them.  You need to read into each
one individually.  See java.io.DataInoutStream for how to do this.
Hunter Gratzner - 29 Aug 2007 22:06 GMT
On Aug 29, 9:52 pm, Windsor.Lo...@gmail.com wrote:
> I am a C++ programmer, working on a java program. I need to read a
> binary file using Java.
[quoted text clipped - 15 lines]
>
> fread(&someData, 12, 1, inputFile);

This is already a stupid idea in C++, since there is no guarantee that
sizeof(SOME_DATA) == 12. Since this is a Java group I'd like to
recommend that you consult some C++ resource regarding struct
alignment and padding, data type size, and (network) byte order.

In Java (assuming you have fixed you C++ problem), one would read this
e.g. with a DataInputStream:

/*
* Read data using network byte-order, aka big-endian
* byte-order (MSB first), and no padding/alignment
* between the data.
*/
class Data {
  /*
   * Note, Java has no unsigned data types.
   * Therefore in this example I store the unsigned short
   * in a (signed) int, and the unsigned long in a BigInteger
   * Typically, in a carefully designed application this
   * can be avoided, but I do it here to avoid discussion
   * of using signed data types to handle unsigned types.
   */
  private BigInteger data1; // data format: unsigned long64
  private int data2;        // data format: unsigned short16
  private int data3;        // data format: unsigned short16
  private BigInteger data4; // data format: unsigned long64

  public void read(DataInputStream in) throws IOException {
     byte ulong2big[] = new byte[5];
     ulong2big[0] = 0;  // ensure MSB is always zero, so
                        // we get an unsigned interpretation
                        // of the following 4 byte data
                        // when converting the array to a
                        // BigInteger

     // Read four bytes and convert them to a BigInteger
     // In carefully designed applications a
     // data1 = in.readLong() would do.
     in.read(ulong2big, 1, 4); // TODO: check return value
     data1 = new BigInteger(ulong2big);

     // Read the unsigned short into an int
     data2 = in.readUnisgnedShort();
     // in.skipByte(...) in case padding needs to be skipped

     data3 = in.readUnsignedShort();
     // in.skipByte(...) in case padding needs to be skipped

     // Read four bytes and convert them to a BigInteger
     // In carefully designed applications a
     // data4 = in.readLong() would do.
     in.read(ulong2big, 1, 4); // TODO: check return value
     data4 = new BigInteger(ulong2big);
 }
}
Roedy Green - 29 Aug 2007 22:09 GMT
>I am a C++ programmer, working on a java program. I need to read a
>binary file using Java.

see http://mindprod.com/applet/fileio.html

It will show you how to read big and little endian binary data.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green - 29 Aug 2007 22:11 GMT
On Wed, 29 Aug 2007 21:09:01 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>see http://mindprod.com/applet/fileio.html
>
>It will show you how to read big and little endian binary data.

IF you are trying to slew through records reading only a field or two
per long record try nio.  see http://mindprod.com/jgloss/nio.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Windsor.Locks@gmail.com - 30 Aug 2007 00:47 GMT
On Aug 29, 2:52 pm, Windsor.Lo...@gmail.com wrote:
> I am a C++ programmer, working on a java program. I need to read a
> binary file using Java.
[quoted text clipped - 18 lines]
> Please give me some pointers, how do i read this using Java? Thanks.
> BTW, those are not the variable names I use in my program.

Thank you for all who tried to help. I got it working and in the
interest of future programmers here is how I did it.

Of course this is my crappy program with crappy variable names etc,
which I am going to rewrite. Also, the
arr2long function is from here

http://www.captain.at/howto-java-convert-binary-data.php

public class Convert {

        public static void main(String [] args) {

            int crap = 0, doublecrap = 0, counter = 0;

            try {
                String file =  "/opt/workspace/blahblah/binary.file";
                      FileInputStream fis = new FileInputStream(file);
                DataInputStream dis = new DataInputStream(fis);

               int numberBytes = 4;
               byte data1[] = new byte[numberBytes];
               byte data2 [] = new byte[2];
               byte data3 [] = new byte[2];
               byte data4 [] = new byte[numberBytes];

               while (true) {

                   int retval = dis.read(data1);
                   dis.read(data2);
                   dis.read(data3);
                   dis.read(data4);

                   if(retval == -1)
                       break;

                   long stuff = arr2long(data1, 0);
                   long stuff1 = arr2long(data4, 0);
                   System.out.println(stuff + " : " + stuff1);
                   counter ++;

               }

           //    fis.close();
            }
            catch (IOException ioex) {

            }
            finally {
                System.out.println("number of records read : " + counter);
            }
        }

        public static long arr2long (byte[] arr, int start) {
            int i = 0;
            int len = 4;
            int cnt = 0;
            byte[] tmp = new byte[len];
            for (i = start; i < (start + len); i++) {
                tmp[cnt] = arr[i];
                cnt++;
            }
            long accum = 0;
            i = 0;
            for ( int shiftBy = 0; shiftBy < 32; shiftBy += 8 ) {
                accum |= ( (long)( tmp[i] & 0xff ) ) << shiftBy;
                i++;
            }
            return accum;
        }
}
Lew - 30 Aug 2007 02:25 GMT
>> Here is how I read it in C++,
>>
[quoted text clipped - 14 lines]
>>
>> Please give me some pointers, how do i read this using Java? Thanks.
...
> public class Convert {
>
>         public static void main(String [] args) {
>
>             int crap = 0, doublecrap = 0, counter = 0;

etc.
>         }
> }

java.nio.ByteOrder will help you if you use the java.nio package as Roedy
suggested.

Please do not embed TABs in Usenet posts; it really fubars the alignment.

Signature

Lew

Roedy Green - 30 Aug 2007 04:15 GMT
>Thank you for all who tried to help. I got it working and in the
>interest of future programmers here is how I did it.

You are trying to read little-endian data.  It is a lot easier with
LEDatastream.

float f = dis.readFloat();
double d = dis.readDouble();
int i = dis.readInt();
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Windsor.Locks@gmail.com - 30 Aug 2007 15:26 GMT
On Aug 29, 10:15 pm, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:
> On Wed, 29 Aug 2007 16:47:00 -0700, Windsor.Lo...@gmail.com wrote,
> quoted or indirectly quoted someone who said :
[quoted text clipped - 11 lines]
> Roedy Green Canadian Mind Products
> The Java Glossaryhttp://mindprod.com

Well, that actually does not work. See the reply above by "shakah"
shakah - 30 Aug 2007 16:22 GMT
On Aug 30, 10:26 am, Windsor.Lo...@gmail.com wrote:
> On Aug 29, 10:15 pm, Roedy Green <see_webs...@mindprod.com.invalid>
> wrote:
[quoted text clipped - 16 lines]
>
> Well, that actually does not work. See the reply above by "shakah"

He's suggesting you use his "little-endian DataInputStream" class,
where I'm guessing it would work:
 http://mindprod.com/jgloss/ledatinputstream.html
DRS.Usenet@sengsational.com - 31 Aug 2007 17:15 GMT
I'm not sure if this is the same issue, but I'm trying to interpret
numeric values out of a chunk of data as follows:

int      toBinary theValue
124    1111100    3.8
63    111111    4
224    11100000    4.8
63    111111    4
63    111111    4
224    11100000    4.8
64    1000000    3.2
63    111111    4
244    11110100    5
124    1111100    3.8

I can read "int" out of my blob of data, and I ran toBinaryString on
it just to visualize it.  I manually typed "theValue" (that is what I
KNOW the test data is).  Can someone help me figure out what code to
run in order to get "theValue"?

--Dale--
Roedy Green - 01 Sep 2007 05:01 GMT
On Fri, 31 Aug 2007 09:15:55 -0700, "DRS.Usenet@sengsational.com"
<DRS.Usenet@sengsational.com> wrote, quoted or indirectly quoted
someone who said :

>int      toBinary theValue
>124    1111100    3.8
[quoted text clipped - 12 lines]
>KNOW the test data is).  Can someone help me figure out what code to
>run in order to get "theValue"?

If you get enough samples you can create a
private static final double[] translate = new double[256];
to do the translation for you.

In what context did you see this code?  It looks like it might be some
sort of sound encoding technique.  You can read up the specs on the
encoding.

see http://mindprod.com/jgloss/sound.html to help get you started.

It might also be some sort of Huffman encoding. See
http://mindprod.com/jgloss/huffman.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.