In languages like C and C++, there exist a variety of sizes of integers, char, short, int, long. (char isn't really an integer, but you can use it like an integer and most people use it in C for really small integers.) On most 32 bit systems, these correspond to 1 byte, 2 byte, 4 byte, and 8 byte numbers. But note, do not make the mistake of assuming this is true everywhere. On the other hand, Java, since it is designed to be a portable platform, guarantees that no matter what platform you are running on, a 'byte' is 1 byte, a 'short' is 2 bytes, an 'int' is 4 bytes, and a 'long' is 8 bytes. However, C also provides 'unsigned' types of each of its integers, which Java does not. I find it incredibly annoying that Java doesn't deal with unsigned types, considering the huge number of hardware interfaces, network protocols, and file formats that use them.
(The only exception is that Java does provide the 'char' type, which is a 2 byte representation of unicode, instead of the C 'char' type, which is 1 byte ASCII. Java's 'char' also can be used as an unsigned short, i.e. it represents numbers from 0 up to 2^16. The only gotchas are, wierd things will happen if you try to assign it to a short, and if you try to print it, you'll get the unicode character it represents, instead of the integer value. If you need to print the value of a char, cast it to int first.)
The answer is, you use the signed types that are larger than the original unsigned type. I.e. use a short to hold an unsigned byte, use a long to hold an unsigned int. (And use a char to hold an unsigned short.) Yeah, this kinda sucks because now you're using twice as much memory, but there really is no other solution. (Also bear in mind, access to longs is not guaranteed to be atomic - although if you're using multiple threads, you really should be using synchronization anyway.)
One issue that I really need to go into here is byte order aka endianness, but for the moment we're going to assume (or hope) that whatever you're trying to read is in what is referred to as 'network byte order' aka 'big endian' aka Java's standard endianness.
short anUnsignedByte = 0; char anUnsignedShort = 0; long anUnsignedInt = 0; int firstByte = 0; int secondByte = 0; int thirdByte = 0; int fourthByte = 0; byte buf = getMeSomeData(); // Check to make sure we have enough bytes if(buf.length < (1 + 2 + 4)) doSomeErrorHandling(); int index = 0; firstByte = (0x000000FF & ((int)buf[index])); index++; anUnsignedByte = (short)firstByte; firstByte = (0x000000FF & ((int)buf[index])); secondByte = (0x000000FF & ((int)buf[index+1])); index = index+2; anUnsignedShort = (char) (firstByte << 8 | secondByte); firstByte = (0x000000FF & ((int)buf[index])); secondByte = (0x000000FF & ((int)buf[index+1])); thirdByte = (0x000000FF & ((int)buf[index+2])); fourthByte = (0x000000FF & ((int)buf[index+3])); index = index+4; anUnsignedInt = ((long) (firstByte << 24 | secondByte << 16 | thirdByte << 8 | fourthByte)) & 0xFFFFFFFFL;
OK, now that all looks a little complicated. But really it is straightforward. First off, you see a lot of
0x000000FF & (int)buf[index]
What is going on there is that we are promoting a (signed) byte to int, and then doing a bitwise AND operation on it to wipe out everything but the first 8 bits. Because Java treats the byte as signed, if its unsigned value is above > 127, the sign bit will be set (strictly speaking it is not "the sign bit" since numbers are encoded in two's complement) and it will appear to java to be negative. When it gets promoted to int, bits 0 through 7 will be the same as the byte, and bits 8 through 31 will be set to 1. So the bitwise AND with 0x000000FF clears out all of those bits. Note that this could have been written more compactly as;
0xFF & buf[index]Java assumes the leading zeros for 0xFF, and the bitwise & operator automatically promotes the byte to int. But I wanted to be a tad more explicit about it.
The next thing you'll see a lot of is the <<, or bitwise shift left operator. It's shifting the bit patterns of the left int operand left by as many bits as you specify in the right operand So, if you have some int foo = 0x000000FF, then (foo << 8) == 0x0000FF00, and (foo << 16) == 0x00FF0000.
The last piece of the puzzle is |, the bitwise OR operator. Assume you've loaded both bytes of an unsigned short into separate integers, so you have 0x00000012 and 0x00000034. Now you shift one of the bytes by 8 bits to the left, so you have 0x00001200 and 0x00000034, but you still need to stick them together. So you bitwise OR them, and you have 0x00001200 | 0x00000034 = 0x00001234. This is then stored into Java's 'char' type.
That's basically it, except that in the case of the unsigned int, you have to now store it into the long, and you're back up against that sign extension problem we started with. No problem, just cast your int to long, then do the bitwise AND with 0xFFFFFFFFL. (Note the trailing L to tell Java this is a literal of type 'long' integer.)
buf = (anUnsignedInt & 0xFF000000L) >> 24; buf = (anUnsignedInt & 0x00FF0000L) >> 16; buf = (anUnsignedInt & 0x0000FF00L) >> 8; buf = (anUnsignedInt & 0x000000FFL); buf = (anUnsignedShort & 0xFF00) >> 8; buf = (anUnsignedShort & 0x00FF); buf = (anUnsignedByte & 0xFF);
You care about this because if you assume data is written in say, 'big endian' format, and write your code accordingly, and it turns out the data is in 'little endian' format, you're going to get garbage. And vice versa for assuming 'little endian' for data written in 'big endian'.
Every number, whether expressed as digits, i.e. the number 500,000,007 or as bytes, i.e. the four byte integer (as hexadecimal) 0x1DCD6507, can be thought of as a string of digits. And this string of digits can be thought of as having a starting, or left, 'end', and a finishing, or 'right' end. In English, the first digit in a number is always the biggest (or most significant) digit - i.e. the 5 in 500,000,007 actually represents the value 500,000,000. The last digit in a number is always the littlest (or least significant) digit - i.e. the 7 in 500,000,007 represents the value 7.
When we talk about Endianness, or byte order, we are talking about the order in which we write down the digits. Do we write the biggest (most significant) digit first, and then the next biggest, etc etc, until we reach the littlest (least significant) digit? Or do we start with the littlest digit first? In English we always write the biggest one first, hence English could be described as "big endian". (There are human languages where they do it the other way around. My friend Raichle suggests Hebrew as an example of one such language, but she's not really sure.)
In the example above, the value 50,0000,007, in hexadecimal, is 0x1DCD6507. And if we break it up into four separate bytes, the byte values are 0x1D, 0xCD, 0x65, and 0x07. In decimal those four bytes are (respectively) 29, 205, 101, and 7. The biggest byte is the first one, 29, which represents the value 29 * 256 * 256 * 256 = 486539264. The second biggest is the second byte, 205, which represents the value 205 * 256 * 256 = 13434880. The third biggest byte is 101, which represents the value 101 * 256 = 25856, and the littlest byte is 7, which represents the value 7 * 1 = 7. The values 486539264 + 13434880 + 25856 + 7 = 500,000,007.
When the computer stores those four bytes into four locations in memory, let's say into memory addresses 2056, 2057, 2058, and 2059, the question is, what order does it store them in? It might put 29 in 2056, 205 in 2057, 101 in 2058, and 7 in 2059, just like you'd write down the number in English. If it did, then we would say that computer is "big endian". However, a different computer architecture might just as easily store those bytes by putting 7 in 2056, 101 in 2057, 205 in 2058, and 29 in 2059. If the computer stored the bytes in that manner then we would say that it was "little endian".
Note that this also applies to how the computer stores 2 byte shorts and 8 byte longs. Also note that the "biggest" byte is also referred to as the "Most Significant Byte" (MSB) and the "littlest" byte is also referred to as the "Least Significant Byte" (LSB), so you will often times see those two phrases or acronyms popping up.
But what happens when you're dealing with data generated by other languages? Well, now you have to pay attention to things a bit. You have to make sure that you decode the bytes in the same order as they were originally encoded in, and likewise make sure that you encode the bytes in the same order as they will be decoded in. If you're lucky this will be specified somewhere in the spec or API for whatever protocol or file format you're dealing with. If you're unlucky... well. Good luck.
The biggest problem is simply remembering what your byte order is, and knowing what the byte order of the data you are reading in is. If they're not the same, you need to make sure you re-order them correctly, or in the case of dealing with unsigned numbers like above, you need to make sure you put the correct bytes into the correct parts of the integer/short/long.
> Q: Programmers often talk about the advantages and disadvantages of > programming in a "simple language." What does that phrase mean to > you, and is [C/C++/Java] a simple language in your view? > > Ritchie: [deleted for brevity] > > Stroustrup: [deleted for brevity] > > Gosling: For me as a language designer, which I don't really count > myself as these days, what "simple" really ended up meaning was could > I expect J. Random Developer to hold the spec in his head. That > definition says that, for instance, Java isn't -- and in fact a lot of > these languages end up with a lot of corner cases, things that nobody > really understands. Quiz any C developer about unsigned, and pretty > soon you discover that almost no C developers actually understand what > goes on with unsigned, what unsigned arithmetic is. Things like that > made C complex. The language part of Java is, I think, pretty > simple. The libraries you have to look up.
On the other hand.... According to http://www.artima.com/weblogs/viewpost.jsp?thread=7555
> Once Upon an Oak ... > by Heinz Kabutz > July 15, 2003 > ... > Trying to fill my gaps of Java's history, I started digging around on > Sun's website, and eventually stumbled across the Oak Language > Specification for Oak version 0.2. Oak was the original name of what > is now commonly known as Java, and this manual is the oldest manual > available for Oak (i.e. Java). ... > Unsigned integer values (Section 3.1) > > The specification says: "The four integer types of widths of 8, 16, 32 > and 64 bits, and are signed unless prefixed by the unsigned modifier. > > In the sidebar it says: "unsigned isn't implemented yet; it might > never be." How right you were.The Oak Language Specification for Oak version 0.2 can be downloaded in as postscript from https://duke.dev.java.net/green/OakSpec0.2.ps or zipped PDF from http://www.me.umn.edu/~shivane/blogs/cafefeed/resources/14-jun-2007/OakSpec0.2.zip (assuming these links haven't broken...)