Java Forum / General / August 2007
Explanation needed of binary operators
NoNeYa - 26 Jul 2007 06:03 GMT Howdy folks, I am trying to read an int from a file written in binary. After researching that archaic file structure, I have found that it is stored as little endian/ least significant first. I have seen code to read this as an int in Java but don't understand what is happening. I guess that I just don't understand what is going on with the shifting of bits and "anding" with other values. Could someone please explain this in detail? Or lead me to a good source for in-depth reading? I don't just want code... I want to understand. If this seems trivial, please excuse me... I'm only a second semester Computer Science major who likes to keep his brain busy during the summer off and would like to make a new front-end for an old program that I use at work.
Thanks!
Twisted - 26 Jul 2007 09:22 GMT > Howdy folks, > I am trying to read an int from a file written in binary. After [quoted text clipped - 10 lines] > > Thanks! Endianness is a tricky matter. A 16-bit int provides a simple example because there's only two ways around it would normally go:
highbyte lowbyte
or
lowbyte highbyte
To reconstruct it you only need to read into byte variables named "highByte" and "lowByte", the correct one into each, and then
short result = (((short)highByte)*256) | ((short)lowByte);
and Bob's your uncle. The high byte, times 256, logical-OR'd with the low byte is correct. (Simply adding the low byte fails if the byte type is signed, as I think it is in Java, though not in C with e.g. "typedef unsigned char byte;" -- logical ORing it should work. Likewise shifting left the high byte should work, but the sign bit of the high byte is also the sign bit of the short so...)
With Java ints and C longs (32 bits) you've got 24 possible orderings of the four bytes, because the first can be in any of four places, the second in any of the remaining three, and the third in either of the remaining two, before the fourth is forced into the only remaining place -- 4*3*2 is 24. In practise, the orders you usually see are two little-endian shorts or two big-endian shorts, with the high short either first or last (so two independent endian choices and at most four common byte-orderings).
Any order whatsoever can be dealt with by getting byte1, byte2, byte3, and byte4 to refer to the LSB, next least significant, and so forth reading them in whichever order they occur in the data stream (so you might read byte3 first, depending on the byte order in the stream). Then left shifting and oring:
int result = (((int)byte4)<<24) || (((int)byte3)<<16) || (((int)byte2)<<8) || ((int)byte1)
This should work as long as sign extension isn't used (I think that requires <<< and is found in Java but not C or C++).
The basic explanation is that you have 32 bits in a line. The first eight are the high byte, and the last eight are the low byte of the int. (Java int here; C/C++ users must use long to ensure having 32 bits. Java long is always 64 bits, more than you need here.)
The high byte is cast to an int, which makes it an int with the eight bits we're interested in the last eight. We need them in the first eight, and the <<24 shifts them left 24, so the 7th bit (the 7th back from the end, or first of the interesting eight) is shifted to become bit 7+24=31, or the leftmost (as there are only 31 bits left of the last, or zeroth, bit in an int). So the shift moves the eight interesting bits into the top eight. The shifts on byte3 and byte2 make the bits in them move to the middle positions. The last one isn't shifted and stays in the lowest position. So after the shifts but before the logical-ors, we have changed say
file1: ZWXY
into
byte1: ...X (. = eight zero bits) byte2: ...Y byte3: ...Z byte4: ...W
(by reading byte3, byte4, byte1, and byte1 in that order)
into
temp1: ...X temp2: ..Y. temp3: .Z.. temp4: W...
and now the logical OR operations just combine them by copying the nonzero bits into the result at the same place, so Z or . is Z, . or W is W, etc. and we get:
result: WZYX
with the correct byte order unscrambled from the file's ZWXY order.
Mike Schilling - 18 Aug 2007 01:32 GMT > With Java ints and C longs (32 bits) you've got 24 possible orderings > of the four bytes, because the first can be in any of four places, the [quoted text clipped - 4 lines] > either first or last (so two independent endian choices and at most > four common byte-orderings). In fact, you're very unlikely to see anything other than strict little-endian (LSB-B2-B3-MSB) or strict big-endian.(MSB-B3-B2-LSB) The only exception I've ever seen was the ordering used by the PDP-11 floating-point-processor [1], which was B3-MSB-LSB-B2.
1. Which could process both floats and 32-bit integers.
Ben Phillips - 18 Aug 2007 02:29 GMT >>With Java ints and C longs (32 bits) you've got 24 possible orderings >>of the four bytes, because the first can be in any of four places, the [quoted text clipped - 9 lines] > exception I've ever seen was the ordering used by the PDP-11 > floating-point-processor [1], which was B3-MSB-LSB-B2. That's the third of the four Twisted mentioned, the other being B2-LSB-MSB-B3.
I can't recall ever seeing a byte sex other than one of those three myself, and only the two strict-endian ones seem to be used in any modern PC or server hardware architectures.
OTOH I can recall a proliferation of very incompatible systems back in the good old days -- 9- and 10-bit bytes, 7-bit bytes, even 6-bit bytes and binary-coded decimal (yuck!!), and character orderings for the basic A-Z stuff other than ASCII (EBCDIC, notably) or various bastardized forms of almost-ASCII. (Pop quiz -- which popular system's pseudo-ASCII had no {}, rearranged !@#$%^&*(), had an actual up-arrow symbol for ^, had a £ symbol in the low 127, and had control characters that represented colours? It actually let you type these in, mostly with shift-number or other-modifier-key-number.)
These days we have it *easy*, with big-endian and little-endian and cr/lf/crlf as the only two spots of low level data conversion awkwardness. At least a byte is a byte is a byte is eight bits long and character 65 (0x41; 081; 01000001) is always 'A'! :)
(Imagine trying to write, edit, or otherwise work with C source on a system with no {} characters! It's probably no coincidence the systems with {} missing were mainly programmed in assembly, or sometimes in something icky like BASIC, and absolutely never in anything portable.)
Mike Schilling - 18 Aug 2007 02:55 GMT > (Imagine trying to write, edit, or otherwise work with C source on a > system with no {} characters! It's probably no coincidence the systems > with {} missing were mainly programmed in assembly, or sometimes in > something icky like BASIC, and absolutely never in anything portable.) Many European keyboards lacked both square and curly brackets, leading to the use of digraphs. See http://david.tribble.com/text/cdiffs.htm#C90-digraph. And yes, that's another horror we no longer contend with.
Real Gagnon - 18 Aug 2007 03:03 GMT Ben Phillips <b.phillips@a5723mailhost.net> wrote in news:fa5i4e$mpk$1 @aioe.org:
> (Pop quiz -- which popular system's pseudo-ASCII > had no {}, rearranged !@#$%^&*(), had an actual up-arrow symbol for ^, > had a œ symbol in the low 127, and had control characters that > represented colours? It actually let you type these in, mostly with > shift-number or other-modifier-key-number.) Looks like the Sinclair ZX Spectrum to me!
Bye!
 Signature Real Gagnon from Quebec, Canada * Java, Javascript, VBScript and PowerBuilder code snippets * http://www.rgagnon.com/howto.html * http://www.rgagnon.com/bigindex.html
Arne Vajhøj - 18 Aug 2007 03:07 GMT > These days we have it *easy*, with big-endian and little-endian and > cr/lf/crlf as the only two spots of low level data conversion > awkwardness. At least a byte is a byte is a byte is eight bits long and > character 65 (0x41; 081; 01000001) is always 'A'! :) In the blue world EBCDIC is still used.
Arne
Ben Phillips - 18 Aug 2007 13:00 GMT >> These days we have it *easy*, with big-endian and little-endian and >> cr/lf/crlf as the only two spots of low level data conversion >> awkwardness. At least a byte is a byte is a byte is eight bits long >> and character 65 (0x41; 081; 01000001) is always 'A'! :) > > In the blue world EBCDIC is still used. Meanwhile, on Earth ...
:) Martin Gregorie - 18 Aug 2007 16:11 GMT >> These days we have it *easy*, with big-endian and little-endian and >> cr/lf/crlf as the only two spots of low level data conversion >> awkwardness. At least a byte is a byte is a byte is eight bits long >> and character 65 (0x41; 081; 01000001) is always 'A'! :) > > In the blue world EBCDIC is still used. which, on the AS/400 at least, lacked {} and used trigraphs instead.
Trivia: the reason the character ordering in EBCDIC is such a mess is that the encodings are binary representations of the IBM 029 card punch's hole patterns. That's why you get the odd gaps between I and J and between R and S. This was extended to allow for lower case. And no, I have no idea why 0-9 are F0-F9 rather than 00-09.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Roedy Green - 18 Aug 2007 23:25 GMT On Sat, 18 Aug 2007 16:11:45 +0100, Martin Gregorie <martin@see.sig.for.address> wrote, quoted or indirectly quoted someone who said :
>I have no idea why 0-9 are F0-F9 rather than 00-09. ASCII has the same strangeness. 0-9 are 30-39.
I suspect the reason was pedantry. It would be even harder to get students to understand the difference between the number 0 and the character 0 if they had the same binary representation.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Patricia Shanahan - 18 Aug 2007 23:39 GMT > On Sat, 18 Aug 2007 16:11:45 +0100, Martin Gregorie > <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 8 lines] > students to understand the difference between the number 0 and the > character 0 if they had the same binary representation. For ASCII, I think there was some deference to paper tape mechanics. Treating no holes as NUL allows records to be separated by blocks of unpunched tape. Treating all holes as DEL allows anything to be overpunched into being a DEL.
Patricia
Martin Gregorie - 19 Aug 2007 17:31 GMT >> On Sat, 18 Aug 2007 16:11:45 +0100, Martin Gregorie >> <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 13 lines] > unpunched tape. Treating all holes as DEL allows anything to be > overpunched into being a DEL. As a long lapsed user of the Flexowriter and the ASR-33 teletype, not to mention the manual 8 hole paper tape punch this is exactly right.
FWIW my original wonderment at non use of 00-09 was because AFAICR everybody and everything except IBM's EBCDIC sorts numerics before alphabetics and, unless I've confused what little history I ever knew, always has done it that way since Adam were a lad.
Given that EBCDIC puts capitals in zones 1-3 and lower case in zones 4-6, the only sensible place to put numerics is zone 0 because that would preserve a natural sort order. I don't agree that this would cause confusion over 00. A completely blank card column meant 'space' and a zero was a single hole punched in the zero row.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
John W. Kennedy - 21 Aug 2007 03:00 GMT > FWIW my original wonderment at non use of 00-09 was because AFAICR > everybody and everything except IBM's EBCDIC sorts numerics before > alphabetics and, unless I've confused what little history I ever knew, > always has done it that way since Adam were a lad. No, IBM equipment generally collated numerics after alphabetics long before EBCDIC, even on machines, such as the 1401, where the binary representation was the other way. EBCDIC was designed specifically so as to continue this behavior.
There are many ways of collating. For one dramatic example, US telephone directories traditionally collate space after alphanumerics, so that AAA comes before AA.
 Signature John W. Kennedy "Information is light. Information, in itself, about anything, is light." -- Tom Stoppard. "Night and Day"
Martin Gregorie - 21 Aug 2007 11:51 GMT > No, IBM equipment generally collated numerics after alphabetics long > before EBCDIC, even on machines, such as the 1401, where the binary > representation was the other way. EBCDIC was designed specifically so as > to continue this behavior. Thanks for that correction. I came in via ICL kit in the late 60s, when S/360 had largely replaced the 1400, so I never understood EBCDIC until the ICL 2900 (which used EBCDIC) replaced the 1900 around 1980. ICL 1900 mainframes used the 6 bit ISO alternate character set. This sorted in the order space, numeric, alphabetic.
> There are many ways of collating. For one dramatic example, US telephone > directories traditionally collate space after alphanumerics, so that AAA > comes before AA. 6 bit ISO, as the 1900 used it, had two shifts (IIRC you used the SI and SO characters to switch between them), so a sort had to really jump through hoops if you were using mixed case keys and mixed case lookups were a real horror which, thankfully, I managed to avoid.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Stefan Ram - 18 Aug 2007 23:41 GMT >I suspect the reason was pedantry. It would be even harder to >get students to understand the difference between the number 0 >and the character 0 if they had the same binary representation. On teletypes, one could get certain effects with the bit patterns »NUL« (»0000000«) and »DEL« (»1111111«).
DEL will punch all-holes, so it will erase any information. When the motor starts, the first characters sent might be lost, so sending some NULs at the start of a transmission will give the motor time to start.
This dictated that the blocks containing those bit patterns had to be control blocks in X3.4-1963.
I am working on a German language page about X3.4-1963:
http://www.purl.org/stefan_ram/pub/ascii_1963_de
Roedy Green - 19 Aug 2007 00:18 GMT > I am working on a German language page about X3.4-1963: Then did ASCII come out. I recall talking with Vern Detwiler (who later founded MacDonald Detwiler) about what character set we should use for the new IBM 7044. Back then each university devised it own character set. I remember him talking about same new fangled 7-bit code called ASCII. He was devising our 6-bit code to be as compatible as possible with it.
Back then I was using 4 and 6 bit paper tape. Punch cards were mostly 1 or 2 holes of a possible 12. Later I used TTYs. I forget how many holes wide their tape was, though I certainly remember editing programs with paper tape, where you would copy up the error, type the correction, manually space over the error, and resume the copy at perhaps the blinding speed of 15 cps. It seems amazing what I was able to accomplish with such primitive editing tools.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Martin Gregorie - 19 Aug 2007 17:38 GMT > Back then I was using 4 and 6 bit paper tape. Punch cards were mostly > 1 or 2 holes of a possible 12. Later I used TTYs. I forget how many [quoted text clipped - 3 lines] > perhaps the blinding speed of 15 cps. It seems amazing what I was > able to accomplish with such primitive editing tools. I always liked paper tape. It was less bulky than cards and you didn't need to find a card sorter or spend hours rebuilding the deck if you dropped it. Tangles? Just throw the tape out a top floor window or down the stair well (remembering to keep a grip on one end) and rewind it.
The only advantages of cards were that they were great for shopping lists and you could make a neat glider from two cards and a pencil.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Mike Schilling - 19 Aug 2007 18:13 GMT > I always liked paper tape. It was less bulky than cards and you didn't > need to find a card sorter or spend hours rebuilding the deck if you [quoted text clipped - 3 lines] > The only advantages of cards were that they were great for shopping > lists and you could make a neat glider from two cards and a pencil. If you throw a card deck from a high window, it becomes nice (if oversized) confetti.
Martin Gregorie - 19 Aug 2007 21:17 GMT > If you throw a card deck from a high window, it becomes nice (if oversized) > confetti. Quite.
And the sorter was no use unless you put sequence numbers on the deck and maintained it as well. That's why the original COBOL spec had a 6 digit sequence number at the start of every line.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Patricia Shanahan - 19 Aug 2007 18:19 GMT >> Back then I was using 4 and 6 bit paper tape. Punch cards were mostly >> 1 or 2 holes of a possible 12. Later I used TTYs. I forget how many [quoted text clipped - 11 lines] > The only advantages of cards were that they were great for shopping > lists and you could make a neat glider from two cards and a pencil. There were some other advantages:
1. Content printing on each card. I could never, even when I was handling paper tape a lot, read ASCII codes as fast as I could read printed text.
2. Ease of changes in the middle of a file. The two procedures for tape were the one Roedy described above, and physical cut-and-splice. Splicing increased the risk of mechanical problems. Contrast that with inserting and removing cards in the middle of a card deck.
Patricia
Martin Gregorie - 19 Aug 2007 22:00 GMT >> The only advantages of cards were that they were great for shopping >> lists and you could make a neat glider from two cards and a pencil. I forgot a third: the chads made great, if itchy confetti.
And a fourth: card correction by pushing chad(s) into holes before punching new ones with a 12 key hand punch. This only worked if, like us, you used optical card readers that didn't flex the cards.
> There were some other advantages: > > 1. Content printing on each card. I could never, even when I was > handling paper tape a lot, read ASCII codes as fast as I could read > printed text. I used to be able to read enough (newline, tab, space, numbers) to find the right place on a tape.
As regards cards: our programmer's standby, the 12 key hand punch, didn't print, so I learnt to read card codes at a good rate. Later we were given printing hand punches but they were like a Dymo tape punch: you had to dial the character and then hit to PUNCH bar to punch a column. They were slow as hell: we hated them and used the old 12 key punches by preference. I wish I'd had the sense to liberate one of the 12 key punches when they were phased out. They were marvelous Victorian engineering: the best ones had cast iron bodies with a riveted-on brass name plate saying "British Tabulating Machine Company". Their punches never got blunt or jammed and they never wore out.
> 2. Ease of changes in the middle of a file. The two procedures for tape > were the one Roedy described above, and physical cut-and-splice. > Splicing increased the risk of mechanical problems. Not if done right. I only used tape in anger at University to write Algol 60 for an Elliott 503, the only machine I know that was faster at floating point than integer arithmetic. Very appropriate seeing that it was a scientific machine. But I digress....
We used to leave a foot or so of runout between procedure declarations and in other suitable places, so we never had to copy & edit more than a few feet of tape and splices never overlapped punched tape. IIRC we used thin plastic heat-seal splicing tape. I don't remember having failed splices or tape wrecks due to splices.
> Contrast that with > inserting and removing cards in the middle of a card deck. Actually, we only used a large program pack once and then slung them because, even in 1968, we kept all program source on tape. Once a source had been loaded we used small decks to edit the source on tape. The programmer's overnight run started with a batch edit run that did everybody's edits. This was followed by a batch compile. After that individual test shots were run from the tape holding the compiled programs. That was on an ICL 1900. By 1970 we'd moved our sources to disk and the card decks had become individual edit/compile/test jobs for George 1. A typical job pack would be no more than 50-100 cards. You kept and reshuffled the commands, replacing the edits and test data as needed.
I may be misremembering, but I have the impression that IBM mainframe shops retained source as card decks a lot longer than we did. Certainly, when I did a job in an IBM System/3 shop in NYC in 1976 all program sources and, indeed, the master files as well were still on cards: those nasty little 96 column jobbies.
Eee, lad. Tell that to the young people of today and they'll not believe you.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Roedy Green - 20 Aug 2007 11:50 GMT On Sun, 19 Aug 2007 22:00:54 +0100, Martin Gregorie <martin@see.sig.for.address> wrote, quoted or indirectly quoted someone who said :
>I may be misremembering, but I have the impression that IBM mainframe >shops retained source as card decks a lot longer than we did Univac required mainframes to be sold with a card reader at least as late as 1976. Card readers were perfected shortly after they went obsolete. Air fanned the cards and sucked the top card off the deck. Early ones used a knife edge picker that shredded any card with an tiny burr to the edge. You had to keep reproducing entire decks to keep the edges clean.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 20 Aug 2007 14:37 GMT >2. Ease of changes in the middle of a file. The two procedures for tape >were the one Roedy described above, and physical cut-and-splice. >Splicing increased the risk of mechanical problems. Contrast that with >inserting and removing cards in the middle of a card deck. The old mechanical equipment was much more impressive than today's pizza boxes. An optical paper tape reader shot tape out so fast it formed a 12 foot stream in the air. A 300 LPM printer thundered with the majesty of a Robocop. I was shocked, never having seen printing faster than about 45 CPS before. Unit record equipment made all manner of whirring and kachunking noises that would shake the building. I remember writing a device drive for a Univac OCR device. You had X milliseconds to decide what to do with the document after you read it, which pocket to direct it to. It was a strange thing made of rubber belts. On a 16K machine we did multithread lookahead i/o -- something modern Java programs still do NOT do.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Martin Gregorie - 20 Aug 2007 17:54 GMT > The old mechanical equipment was much more impressive than today's > pizza boxes. An optical paper tape reader shot tape out so fast it > formed a 12 foot stream in the air. Yep. ICL used Elliott 1200 cps paper tape readers, so it moved at 120 ins/sec. Big arcs of tape. The most impressive jam I ever saw was when a bit of sticky tape got left on the end of a reel, which caught on the drive roller. The reader pulled tape out of the bin at 120 ins.sec until the space between roller and its guard was jammed solid and the reader stalled. Even the engineers were impressed - and took forever to clear the reader.
> A 300 LPM printer thundered with > the majesty of a Robocop. I was shocked, never having seen printing > faster than about 45 CPS before. We had a 1250 lpm drum printer. It was generally noisy but when you printed a line of asterisks it made the most godawful KLANG as all 132 print hammers hit the drum simultaneously. It was a sufficiently fast printer to need a power stacker which pulled in the paper to stack it: the machine could page throw at about 3 feet a second and the stacker had to keep up.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Roedy Green - 21 Aug 2007 11:31 GMT On Mon, 20 Aug 2007 17:54:00 +0100, Martin Gregorie <martin@see.sig.for.address> wrote, quoted or indirectly quoted someone who said :
>We had a 1250 lpm drum printer. It was generally noisy but when you >printed a line of asterisks it made the most godawful KLANG as all 132 >print hammers hit the drum simultaneously. It was a sufficiently fast >printer to need a power stacker which pulled in the paper to stack it: >the machine could page throw at about 3 feet a second and the stacker >had to keep up. I presume you were an "operator" at some point in your career and had a faulty mylar tape loop that controlled the vertical tab stops on the printer, causing the paper to slew endlessly at full rate. If it happened when the covers were up you had an great arc in the air. If closed, it packed the printer cover tight as a mummy case. To stop it you stomped your foot on the input paper box to break the paper.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Martin Gregorie - 22 Aug 2007 00:47 GMT > I presume you were an "operator" at some point in your career and had > a faulty mylar tape loop that controlled the vertical tab stops on the > printer, causing the paper to slew endlessly at full rate. If it > happened when the covers were up you had an great arc in the air. If > closed, it packed the printer cover tight as a mummy case. To stop it > you stomped your foot on the input paper box to break the paper. We were a small service bureau with a 1903S to keep busy. Among the systems staff we did everything - analyzed, designed, coded and, when necessary, operated too. I was never good enough to know what George 3 wanted by listening to the control teletype, but I could tell "LP 3 FIX" when I was lining up paper from requests to, e.g. load a magnetic tape. I knew operators who could drive the system entirely off sound for an hour or so when the teletype's print head failed.
I don't remember our fast printer ever turning into a paper fountain - or the paper tape loop breaking, but we did tend to use tougher material than plain paper tape for production loops. I seem to remember that the 1900 printer would only throw about 3 feet of paper (i.e. about two pages) before timing out and stopping. I know for sure that I never broke the feed paper to stop the printer.
The 2900 printers were a nice improvement: they used a software implementation of the paper loop and as well as telling the spooler what sort of paper the job needed, you also told it what control loop to load.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
John W. Kennedy - 21 Aug 2007 02:50 GMT > On Sat, 18 Aug 2007 16:11:45 +0100, Martin Gregorie > <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 8 lines] > students to understand the difference between the number 0 and the > character 0 if they had the same binary representation. ASCII was designed more for telegraphy and interchange media than for internal use.
 Signature John W. Kennedy "Information is light. Information, in itself, about anything, is light." -- Tom Stoppard. "Night and Day"
Mike Schilling - 21 Aug 2007 06:06 GMT >> On Sat, 18 Aug 2007 16:11:45 +0100, Martin Gregorie >> <martin@see.sig.for.address> wrote, quoted or indirectly quoted [quoted text clipped - 11 lines] > ASCII was designed more for telegraphy and interchange media than for > internal use. You mean it was invented for one purpose, pressed into use for another, and is still being used for the one it's not well suited for, long after the one it was designed for has more or less disappeared? Geez, how often does that happen? :-)
Lew - 21 Aug 2007 06:13 GMT Roedy Green wrote:
>>> ASCII has the same strangeness. John W. Kennedy wrote:
>> ASCII was designed more for telegraphy and interchange media than for >> internal use.
> You mean it was invented for one purpose, pressed into use for another, and > is still being used for the one it's not well suited for, long after the one > it was designed for has more or less disappeared? Geez, how often does that > happen? :-) Set to music, it's a vital tool for corporate advancement:
You gotta do some ASCII sing.
 Signature Lew
John W. Kennedy - 21 Aug 2007 02:39 GMT > And no, > I have no idea why 0-9 are F0-F9 rather than 00-09. To match the existing collating sequences. (Many pre-360 machines implemented, in their hardware, collating sequences that did not correspond to the binary values of their character encodings; EBCDIC was designed so that the 360 would not have that anomaly.)
 Signature John W. Kennedy "Never try to take over the international economy based on a radical feminist agenda if you're not sure your leader isn't a transvestite." -- David Misch: "She-Spies", "While You Were Out"
Roedy Green - 21 Aug 2007 11:34 GMT On Mon, 20 Aug 2007 21:39:59 -0400, "John W. Kennedy" <jwkenne@attglobal.net> wrote, quoted or indirectly quoted someone who said :
>To match the existing collating sequences. (Many pre-360 machines >implemented, in their hardware, collating sequences that did not >correspond to the binary values of their character encodings; EBCDIC was >designed so that the 360 would not have that anomaly.) I don't follow. EBCDC '0' is not binary 0. Further , IIRC, the letters A-Z and a-z are not contiguous blocks of binary assignments.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Martin Gregorie - 22 Aug 2007 01:04 GMT > On Mon, 20 Aug 2007 21:39:59 -0400, "John W. Kennedy" > <jwkenne@attglobal.net> wrote, quoted or indirectly quoted someone who [quoted text clipped - 7 lines] > I don't follow. EBCDC '0' is not binary 0. Further , IIRC, the > letters A-Z and a-z are not contiguous blocks of binary assignments. I think the approach is fairly clear: you adjust the binary code values so that sorting on ascending code value gives you the collation sequence you want. In the case of EBCDIC that's pretty weird because the gaps between I and J and between R and S are not empty: they contain a wild assortment of punctuation and other symbols.
John says that the collation sequence predates EBCDIC. I'll go further and guess that it predates computers as well. It was most likely defined by IBM's original card sorters: businesses were running card-based accounting systems in the '30s if not earlier using a room full of sorters, collators and other electro-mechanical monsters.
FWIW the Manhattan Project calculations for the plutonium bomb design were run using IBM card handling kit under the direction of Richard Feynman. It was a faster replacement for the armies of girls with hand-cranked Monroe calculators who had been doing the job. IIRC Feynman thought up the idea of using punched cards.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
John W. Kennedy - 22 Aug 2007 22:03 GMT >> On Mon, 20 Aug 2007 21:39:59 -0400, "John W. Kennedy" >> <jwkenne@attglobal.net> wrote, quoted or indirectly quoted someone who [quoted text clipped - 13 lines] > between I and J and between R and S are not empty: they contain a wild > assortment of punctuation and other symbols. They do /now/, in EBCDIC-version-of-ISO-8859-1 and the like. But all those spaces were empty in 1964. Apart from control characters and lower-case letters, the original EBCDIC had only about 64 characters. So the 64 characters of traditional BCD collated in EBCDIC more or less as they always had, but with a straight binary compare, instead of special collating-sequence hardware.
(That special hardware is why the basic-model 1401 Compare instruction could only compare equal/not-equal; numerics could be high/low/equal compared with a subtraction, but if you wanted to high/low/equal compare alphameric data, you had to both buy a hardware add-on and accept slighly reduced CPU performance.)
Mainframes are slowly moving away from EBCDIC, of course. The newest System Z machines include full support of Unicode, including opcodes to translate among UTF-8, UTF-16, and UTF-32.
> John says that the collation sequence predates EBCDIC. I'll go further > and guess that it predates computers as well. It was most likely defined > by IBM's original card sorters: businesses were running card-based > accounting systems in the '30s if not earlier using a room full of > sorters, collators and other electro-mechanical monsters. Pretty much, yes.
 Signature John W. Kennedy If Bill Gates believes in "intelligent design", why can't he apply it to Windows?
Andreas Leitgeb - 26 Jul 2007 10:52 GMT > I am trying to read an int from a file written in binary. After > researching that archaic file structure, I have found that it is stored as > little endian/ least significant first. you've got two ways from here: 1.) read it in as an integer, and then do mask&shift-magic on the integer to obtain an endian-swapped version of it. 2.) read four bytes separately, and compose them to an integer.
anyway, you need to be aware of how the separate bits consitute the final result: in the stream you have b1 b2 b3 b4 four bytes. The integer value, you want, is: 0x1*b1 + 0x100*b2 + 0x10000*b3 + 0x1000000*b4 by nature of little ends :-)
Multiplication by these constants is equivalent to *left*-shifting by 0,8,16,24 bits respectively. (division would be *right*-shifting)
If you read in the integer canonically from stream, you actually get this number: 0x1000000*b1 + 0x10000*b2 + 0x100*b3 + 0x1*b4 by nature of big ends.
So you'd have to do shifting, masking and finally adding to re-arrange the bit-patterns of the integer.
Sometimes, shifting does the masking for you: if you divide the whole number by 0x1000000 ( >>24 ), it's obvious, that only b1 remains. If you right-shift by 8 bits, then obviously 0x10000*b1 + 0x100*b2 + 0x1*b3 remains, and after masking with 0xff00, only 0x100*b2 remains, another one of the parts which your desired result consists of.
That is: to extract the b1 from that wrong-endian int, you'll just to first *right*-shift it by 24 bits, the other bits vanishing themselves, so no masking necessary here. For b2, you'd shift the original number only 8 bits to the *right*, (to change it's factor from 0x10000 to 0x100), and then mask it with 0xff00, (adapted to b2's bits' target position. For b3 you first mask (again the original value) with 0xff00 and then *left*-shift 8 bits and for b4 you only need to *left*-shift the original number by 24 bits, like b1 no masking necessary. All these separately shifted octets are then re-assembled, either with or-operator "|", or (in this case equivalently) by adding.
I hope, it helped and wasn't itself more complicated than the original problem ;-)
Lew - 26 Jul 2007 14:30 GMT >> I am trying to read an int from a file written in binary. After >> researching that archaic file structure, I have found that it is stored as [quoted text clipped - 48 lines] > I hope, it helped and wasn't itself more complicated than the > original problem ;-) You can also use a java.nio.IntBuffer, which "knows" about endianness through its java.nio.ByteOrder.
 Signature Lew
John W. Kennedy - 02 Aug 2007 02:26 GMT >> I am trying to read an int from a file written in binary. After >> researching that archaic file structure, I have found that it is stored as [quoted text clipped - 4 lines] > on the integer to obtain an endian-swapped version of it. > 2.) read four bytes separately, and compose them to an integer. 3.) in Java 1.5 and up, read it as an integer and then use Short.reverseBytes(), Integer.reverseBytes(), or Long.reverseBytes(), as appropriate.
 Signature John W. Kennedy "The first effect of not believing in God is to believe in anything...." -- Emile Cammaerts, "The Laughing Prophet"
Andreas Leitgeb - 05 Aug 2007 00:24 GMT > 3.) in Java 1.5 and up, read it as an integer and then use > Short.reverseBytes(), Integer.reverseBytes(), or Long.reverseBytes(), as > appropriate. You're of course right, but I (perhaps mis-)understood the original poster that he wanted to understand the details of bit-shifting used for byte-reversing an int.
Roedy Green - 26 Jul 2007 14:21 GMT > have seen code to read this as an >int in Java but don't understand what is happening. I guess that I just >don't understand what is going on with the shifting of bits and "anding" >with other values. see http://mindprod.com/jgloss/endian.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 26 Jul 2007 14:35 GMT > shifting of bits and "anding" for general background an bit fiddling see http://mindprod.com/jgloss/binary.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Nigel Wade - 26 Jul 2007 14:48 GMT > Howdy folks, > I am trying to read an int from a file written in binary. After > researching that archaic file structure, I have found that it is stored as > little endian/ least significant first. An interesting viewpoint, binary data as an "archaic" file structure. I wonder how you would store your data in a non-binary form. Don't forget that a "text" file is simply binary interpreted in a very specific way, and one persons (ASCII) "text" file may be another persons (EBCDIC) binary garbage.
> I have seen code to read this as an > int in Java but don't understand what is happening. I guess that I just [quoted text clipped - 7 lines] > > Thanks! I will do my best to explain - without using any code.
Endian-ness is fun. It adds excitement and joy to the otherwise tedious task of developing portable code to read arbitrary binary data formats. Java has made the life of the data processor much less interesting by taking this task and wrapping it up in the ByteBuffer class. However, for the purposes of learning it is a good thing to understand what it going on behind the scenes.
There are [essentially] two types of endianess, big-endian and little-endian. Big-endian hardware stores bytes in memory in their "natural" format, with the "big" end on the "left" (lower memory address). Little-endian hardware was designed to do the opposite, just to be awkward.
Lets assume we have 3 variables, containing a char, a 16bit int ("short") and a 32bit int ("long"). We'll assign the hex. values of 0x11, 0x1122 and 0x11223344 to these variables respectively. If these variables occupied consecutive memory addresses (or were output to binary file in sequence) on big-endian hardware the contents of memory would be 0x11, 0x11, 0x22, 0x11, 0x22, 0x33, 0x44. On little-endian hardware the values would be 0x11, 0x22, 0x11, 0x44, 0x33, 0x22, 0x11. As you can see, little-endian hardware has reversed the bytes of each value. (NOTE: If you write binary data from Java it is *always* output in big-endian order).
If you write the data to a file and read it back on the same hardware using the same variable types [and the same language] then there is no problem. The bytes will be stored in the correct locations and the variables will have the correct contents. The fun comes when you read the data as a byte array, or attempt to read it on the other type of hardware or use a language which makes different assumptions about the type of data.
To see how it all goes horribly wrong lets try to read the little-endian data file (written by some language other than Java) into Java. Remember, the order of the bytes in the little-endian binary file is 11221144332211. So we read the first byte and treat it as a byte, and this is ok. Next we read the two bytes 0x22 and 0x11 and get the short integer 0x2211, not what we wanted at all. The situation is the same for the "long" integer which will contain 0x44332211. This is where byte shifting and masking becomes necessary (if you don't use Java or don't use ByteBuffer in Java), the contents of the "short" and "long" integers have to be reversed.
You can do this more easily by reading into a byte array and extracting the correct bytes. For example, for the 4-byte "long" integer, reading the bytes into a byte array you will get array[0]=0x44, array[1]=0x33, array[2]=0x22 and array[3]=0x11. To construct the correct integer (0x11223344) you need to shift array[3] left 24 places so it becomes 0x11000000, combine that with array[2] left shifted 16 bits (0x220000) etc. How you write the code to do this is up to you, I said I wouldn't use any code.
 Signature Nigel Wade, System Administrator, Space Plasma Physics Group, University of Leicester, Leicester, LE1 7RH, UK E-mail : nmw@ion.le.ac.uk Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555
NoNeYa - 28 Jul 2007 04:40 GMT >> Howdy folks, >> I am trying to read an int from a file written in binary. After [quoted text clipped - 8 lines] > file is simply binary interpreted in a very specific way, and one persons > (ASCII) "text" file may be another persons (EBCDIC) binary garbage. I think I may have worded that a little odd. I meant to say "The *.DBF file structure itself is archaic, but it does use binary storeage in it's header".
>> I have seen code to read this as an >> int in Java but don't understand what is happening. I guess that I just [quoted text clipped - 91 lines] > up to > you, I said I wouldn't use any code. Mark Space - 30 Jul 2007 20:33 GMT > I think I may have worded that a little odd. I meant to say "The *.DBF file So did any of these explanations help you out?
NoNeYa - 30 Jul 2007 21:37 GMT >> I think I may have worded that a little odd. I meant to say "The *.DBF >> file > > So did any of these explanations help you out? Some have clarified the situation "some-what". I do realize that I am in "way over my head" for my level of education in programming. I am continuing to learn more elsewhere and have asked one of my professors for addition sources of reading. I thank all that responded. To directly answer your question, all of the replies have educated me somewhat, but I am still underwater looking up at the top. I am using other sources to learn more and have purchased another book to read. I just wish I could find a book that deals with reading binary and using binary operators, in a "baby step" process with great explanation and code examples. The problem isn't resolved.... but I ain't givin' up yet!
Thanks.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|