Survey of Floating-Point Formats

   This page gives a very brief summary of floating-point formats that have been implemented in hardware and software over the years. They are listed in order of increasing range (a function of exponent size) rather than by precision or chronilogically.
  
range (overflow value) precision bits B WeWm what
9.22×1018 = 226-1 4.8 24 2 7 16 3-byte excess-6417
1.84×1019 = 226 6.9 30 2 7 23 AMD 9511 (1979)5
9.90×1027 = 882/2-1 5.1 24 8 6 17 Octal excess-3212
1.70×1028 = 227-1 8.1 36 2 8 27 Digital PDP-101 ,18 , VAX (F and D formats)1 ; Honeywell 600, 60001 ,16 ; Univac 110x single1 ; IBM 709x, 704x1
3.40×1038 = 227 7.2 32 2 8 1+23 IEEE 754 single
3.40×1038 = 227 7.2 32 2 8 1+23 Digital PDP-1119 , PDP 166 , VAX
9.99×1049 = 10102/2 8.0 44102d 8d Burroughs B2207
4.31×1068 = 876 11.7 ? 8 7 39 Burroughs 5700, 6700, 7700 single1 ,14 ,16 ,17
7.24×1075 = 1663 7.2 3216 7 24 IBM 360, 3706 ; Amdahl 1; DG Eclipse M/6001
7.24×1075 = 1663 16.8 6416 7 56 IBM 360 double15
5.79×1076 = 2255 7.2 ? 2 9 24 Burroughs 1700 single16
1.16×1077 = 1664 7.2 3216 7 24 HP 30001
9.99×1096 = 103×25+1 7.0 32108- 7d IEEE 754r decimal323 ,4
9.99×1099 = 10102 10.0 ? 102d 10d Most scientific calculators
4.9×10114 = 8127 12.0 48 8 8 40 Burroughs 77006
8.9×10307 = 2210-1 14.7 60 211 1+48 CDC 6000, 66006 , 7000 CYBER
8.9×10307 = 2210-1 ? ? ? ? ? DEC Vax G format; UNIVAC; 110x double1
1.8×10308 = 2210 15.9 64 211 1+52 IEEE 754 double
1.27×10322 = 21070 ? ? ? ? ? CDC 6x00, 7x00, Cyber1
9.99×10384 = 103×27+1 16.0 641010- 16d IEEE 754r decimal643 ,4
9.99×10499 = 10103/2 12.0 ? 103d 12d HP 71B13 , 851 calculators
9.99×10999 = 10103 12.0 ? 103d 12d Texas Instruments 85, 92 calculators
9.99×10999 = 10103 14.0 ? 103d 14d Texas Instruments 89 calculator13
9.99×10999 = 10103 17.0 82103d 17d 68881 Packed Decimal Real (3 BCD digits for exponent, 17 for mantissa, and two sign bits)
1.4×102465 = 2213-3 7.238? 214 24 Cray C90 half8
1.4×102465 = 2213-3 14.161? 214 47 Cray C90 single8
1.4×102465 = 2213-3 28.8110? 214 96 Cray C90 double8
1.1×102466 = 2213 ? ? ? ? ? Cray I1
5.9×104931 = 2214-1 ? ? ? ? ? DEC VAX H format1
1.2×104932 = 2214 19.2 80 215 64 The minimum IEEE 754 double exended size (Pentium; HP/Intel Itanium; Motorola 68040, 68881, 88110)
1.2×104932 = 2214 34.0128 2151+112 IEEE 754r quad2 ,3 (DEC Alpha9 ; IBM S/390 G510 )
9.99×106144 = 103×211+1 34.01281014- 34d IEEE 754r decimal1283 ,4
5.2×109824 = 2215-131 16.0 ? 216 47 PRIME 5016
1.9×1029603 = 8215+12 ? ? 816 ? Burroughs 6700, 7700 double1 ,16
4.3×102525222 = 2223 ? ? 224 ? PARI
1.4×10323228010 = 2230-1616 ? ? 231 ? Mathematica®
≅102147483646 = 10231-2 ? ? ? ? ? Maple®


   Legend:
   B : Base of exponent. This is the amount by which your floating-point number increases if you raise its exponent by 1. Modern formats like IEEE 754 all use base 2, so B is 2, and increasing the exponent field by 1 amounts to multiplying the number by 2. Older formats used base 8, 10 or 16.
   We : Width of exponent. If B is 2, 8 or 16, this is the number of bits (binary digits) in the exponent field. For the specific case of B=2, We is equal to K+1 in the equation 1-2K<e<2K specifying the bounds of the excess-2n exponent in an IEEE 754 representation (see below). When B is 10, there are two cases: "6d" indicates an exponent stored as base-10 digits and the letter d is included to make this clear; "8-" indicates an IEEE binary decimal format, using 2 bits in the combination field and 6 bits in the following exponent field, which together can hold only 3/4 of the values such a width would imply (because the high 2 bits cannot both be 1), thus the legal values are e such that 0≤e<3×26.
   Wm : Width of mantissa. For binary formats with "hidden" or "implied" leading mantissa bits, this is given as "1+N", such as "1+23", the "1+" refers to the leading 1 bit; this plus 23 actual bits gives a total of 24 bits of precision. For decimal formats the letter "d" is shown to make it clear the precision is in decimal digits.
  

IEEE 754 Single Representation
   This is worth describing in a bit more detail because it is so prevalent in the hardware used today, and it is probably what you'll be looking at when you try to decipher a floating-point value from its "raw binary".
   First a warning: Although the "normal" values are what you see when your program is working with real data, proper handling of the rest of the values (denorms, NANs, etc.) is vitally important; otherwise you'll get all sorts of horrible results that are difficult to understand, and usually impossible to fix.
   So, for the normal values (which in this case means, not including the zeros, denorms, NANs, and infinities) the value being represented can be expressed in the following form:
  
value = s 2k+1-N n

where the sign s is -1 or 1, and k and n are integers that fall within the ranges given by:
  
2-2K < k < 2K-1    and    2N-1-1 < n < 2N

for two integers K and N. If you look at the range of k and n you can see that k can have exactly 2K+1-2 values and n can have exactly 2N-1 values, and therefore exactly K+1 bits can be used to store the exponent (including two unused values discussed below) and N-1 bits to store the mantissa. To give a specific example, for IEEE 754 single precision, as the above table shows there are We=8 bits for the exponent and Wm=23 bits for the mantissa, so K is 7 and N is 24.
   The exponent is stored in "excess 2K format", which means the binary value you see is 2K bigger than the actual value of k being represented. For example, when K is 7, is the value 254 is seen, k is 126, and the value being represented is s 2127-N n. This is only true for the normal values just described, not for denorms.
  

The next set of values to understand are the denormalized values (or "denorms"), very small values for which
  
k = 2-2K    and    0 < n < 2N-1

using the same definitions as above. These values use one of the "unused" exponent values, namely the one that is all 0 bits. They are very important because they make overflow work better: instead of jumping suddenly to 0, you lose precision gradually as you go towards 0.
   In addition to making the underflow case a little less severe by losing precision gradually instead of suddenly, denormalized values eliminate a lot of strange bugs that would otherwise occur. For example, the tests "if x>y" and "if x-y>0" yield different results, unless you use denorms.
  

All of the various values are arranged in such a way that hardware or software can perform comparisons treating the data as signed-magnitude integers, and as long as neither argument is a NAN the proper answer will result. Such comparisons even properly handle the infinities and negative zero. (A signed-magnitude integer is a sign bit followed by an unsigned expression of its magnitude this is not the normal signed integer format which is called "2's complement signed integer". As with floats, there are ways to express 0 as a signed-magnitude integer.)
  

Here are some sample values with their binary representation. The binary digits are broken into groups of 4 to help with interpreting a value in hexadecimal. They are shown in order from largest to smallest, with the non-numbers in the places they would fall if they were sorted by their bit patterns.
  
s exponent mantissa value(s)
0 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs
0 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
0 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
0 111.1111.1 000.0000.0000.0000.0000.0000 Infinity
0 111.1111.0 111.1111.1111.1111.1111.1111 3.402×1038
0 100.0000.1 000.0000.0000.0000.0000.0000 4.0
0 100.0000.0 100.0000.0000.0000.0000.0000 3.0
0 100.0000.0 000.0000.0000.0000.0000.0000 2.0
0 011.1111.1 000.0000.0000.0000.0000.0000 1.0
0 011.1111.0 000.0000.0000.0000.0000.0000 0.5
0 000.0000.1 000.0000.0000.0000.0000.0000 1.175×10-38 (Smallest normalized value)
0 000.0000.0 111.1111.1111.1111.1111.1111 1.175×10-38 (Largest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0001 1.401×10-45 (Smallest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0000 0
1 000.0000.0 000.0000.0000.0000.0000.0000 -0
1 000.0000.0 000.0000.0000.0000.0000.0001 -1.401×10-45 (Smallest denormalized value)
1 000.0000.0 111.1111.1111.1111.1111.1111 -1.175×10-38 (Largest denormalized value)
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38 (Smallest normalized value)
1 011.1111.0 000.0000.0000.0000.0000.0000 -0.5
1 011.1111.1 000.0000.0000.0000.0000.0000 -1.0
1 100.0000.0 000.0000.0000.0000.0000.0000 -2.0
1 100.0000.0 100.0000.0000.0000.0000.0000 -3.0
1 100.0000.1 000.0000.0000.0000.0000.0000 -4.0
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38
1 111.1111.0 111.1111.1111.1111.1111.1111 -3.402×1038
1 111.1111.1 000.0000.0000.0000.0000.0000 Negative infinity
1 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
1 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
1 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs
  


  

IEEE 754d Decimal Formats
   The decimal32, decimal64 and decimal128 formats defined in the proposed standard IEEE 754d are interesting largely because of their innovative packing of 3 decimal digits into 10 binary digits. Decimal formats are still useful because they can store decimal fractions (like 0.01) precisely. Normal BCD (binary-coded decimal) uses 4 binary digits for each decimal digit, requiring a waste of about 17% of the information capacity of the bits. The 1000 combinations of 3 decimal digits fit nearly perfectly into the 1024 combinations of 10 binary digits. In addition to the space efficiency, groups of 3 work well for formatting and printing which typically use a thousands separator (such as "," or a blank space) between groups of 3 digits. However, prospects for easy encoding and decoding seem bleak. In 1975 Chen and Ho published the first such system, but it had some drawbacks. The Cowlishaw encoding4 , used by IEEE 754d, is remarkable because it manages to achieve all of the following desirable goals:
  




  


   Footnotes and Sources
  
1 : http://http.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf W. Kahan, "Why do we need a floating-point arithmetic standard?", 1981
  
2 : http://http.cs.berkeley.edu/~wkahan/ieee754status/Names.pdf W. Kahan, "Names for Standardized Floating-Point Formats", 2002 (work in progress)
  
3 : http://754r.ucbtest.org/ "Some Proposals for Revising ANSI/IEEE Std 754-1985"
  
4 : http://www2.hursley.ibm.com/decimal/DPDecimal.html "A Summary of Densely Packed Decimal encoding" (web page)
  
5 : http://www3.sk.sympatico.ca/jbayko/cpu1.html
  
6 : http://twins.pmf.ukim.edu.mk/predava/DSM/procesor/float.htm
  
7 : http://www.cc.gatech.edu/gvu/people/randy.carpenter/folklore/v5n2.html
  
8 : http://www.usm.uni-muenchen.de/people/puls/f77to90/cray.html
  
9 : http://www.usm.uni-muenchen.de/people/puls/f77to90/alpha.html
  
10 : http://www.research.ibm.com/journal/rd/435/schwarz.html and http://www.research.ibm.com/journal/rd/435/schwa1.gif
  
11 : http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html
  
12 : One source gave 831 as the range for the Burroughs B5500. (I forgot to save my source for this. I have sources for other Burroughs systems, giving 876 as the highest value (and 8-50 as the lowest, for a field width of 7 bits); and Cowlishaw13 describes its (unrelated) decimal floating-point format. I might have inferred it from http://www.cs.science.cmu.ac.th/panutson/433.htm which only gives a field width of 6 bits, and no bias. The Burroughs 5000 manual says the mantissa is 39 bits, but does not talk about exponent range. Did some models have a 6-bit exponent field? Since these are the folks who simplified things by storing all integers as floating-point numbers with an exponent of 017 , I suspect anything is possible.
  
13 : http://www2.hursley.ibm.com/decimal/IEEE-cowlishaw-arith16.pdf Michael F. Cowlishaw, "Decimal Floating-Point: Algorism for Computers", Proceedings of the 16th IEEE Symposium on Computer Arithmetic, 2003; ISSN 1063-6889/03
  
14 : http://research.microsoft.com/users/GBell/Computer_Structures_Principles_and_Examples/csp0146.htm D. Siewiorek, C. Gordon Bell and Allen Newell, "Computer Structures: Principles and Examples", 1982, p. 130
  
15 : http://research.microsoft.com/~gbell/Computer_Structures__Readings_and_Examples/00000612.htm C. Gordon Bell and Allen Newell, "Computer Structures: Readings and Examples", 1971, p. 592
  
16 : http://www.csit.fsu.edu/~burkardt/f_src/slap/slap.f90 FORTRAN-90 implementation of a linear algebra package, which curiously begins with a table of machine floating-point register parameters for lots of old mainframes. See also http://interval.louisiana.edu/pub/intervalmath/Fortran90_software/d1i1mach.for , which gives actual binary values of the smallest and largest values for many systems.
  
17 : http://grouper.ieee.org/groups/754/meeting-minutes/02-04-18.html Includes this brief description of the key design feature of the Burroughs B5500: "ints and floats with the same value have the same strings in registers and memory. The octal point at the right, zero exponent." This shows why the exponent range is quoted as 8-50 (or 8-51) to 876: The exponent ranged from 8-63 to 863, and the (for floating-point, always normalized) 13-digit mantissa held any value from 812 up to nearly 813, shifting both ends of the range up by that amount.
  
18 : http://www.inwap.com/pdp10/hbaker/pdp-10/Floating-Point.html
  
19 : http://nssdc.gsfc.nasa.gov/nssdc/formats/PDP-11.htm
  
20 : This format would be easy to implement on an 8-bit microprocessor. It has the sign and exponent in one byte, and a 16-bit mantissa and an explicit leading 1 bit (if the leading 1 is hidden/implied, we get twice the range). With only 4-5 decimal digits it isn't too useful, but it's what you could expect to see on a really small early home computer.

Robert Munafo's home pages on
Earthlink
© 1996-2004 Robert P. Munafo. Email the author
 
Back to my main page m.3