Survey of Floating-Point Formats

   This page gives a very brief summary of floating-point formats that have been used over the years. Most have been implemented in hardware and/or software and used for real work; a few (notably the small ones at the beginning) just for lecture and homework examples. They are listed in order of increasing range (a function of exponent size) rather than by precision or chronilogically.
  
range (overflow value) precision bits B WeWm what
14 0.6 6 2 3 2 Used in university courses21 ,22
240 0.9 8 2 4 3 Used in university courses21 ,22
65504 = 215×(2-2-10) 3.3 16 2 5 10 2-byte excess-1524 ,25 ,27 nVidia NV3x GPUs. Also called "half" or "s10e5", "fp16" and "half"; largest minifloat. Can approximate any 16-bit unsigned integer or its reciprocal to 3 decimal places.
2.81×1014 3.6 18 8 2 12 excess-15 octal, 4-digit mantissa. A fairly decent radix-8 format in an 18 bit PDP-10 halfword
9.22×1018 = 226-1 4.8 24 2 7 16 3-byte excess-6317 ATI R3x0 and Rv350 GPUs. Also called "s16e7" or "fp24".
1.84×1019 = 226 6.9 30 2 7 23 AMD 9511 (1979)5
9.90×1027 = 882/2-1 5.1 24 8 6 17 Octal excess-3212
1.70×1028 = 227-1 8.1 36 2 8 27 Digital PDP-101 ,18 , VAX (F and D formats)1 ; Honeywell 600, 60001 ,16 ; Univac 110x single1 ; IBM 709x, 704x1
3.40×1038 = 227 7.2 32 2 8 1+23 IEEE 754 single
3.40×1038 = 227 7.2 32 2 8 1+23 Digital PDP-1119 , PDP 166 , VAX
9.99×1049 = 10102/2 8.0 44102d 8d Burroughs B2207
4.31×1068 = 876 11.7 ? 8 7 39 Burroughs 5700, 6700, 7700 single1 ,14 ,16 ,17
7.24×1075 = 1663 7.2 3216 7 24 IBM 360, 3706 ; Amdahl 1; DG Eclipse M/6001
7.24×1075 = 1663 16.8 6416 7 56 IBM 360 double15
5.79×1076 = 2255 7.2 ? 2 9 24 Burroughs 1700 single16
1.16×1077 = 1664 7.2 3216 7 24 HP 30001
9.99×1096 = 103×25+1 7.0 32108- 7d IEEE 754r decimal323 ,4
9.99×1099 = 10102 10.0 ? 102d 10d Most scientific calculators
4.9×10114 = 8127 12.0 48 8 8 40 Burroughs 77006
8.9×10307 = 2210-1 14.7 60 211 1+48 CDC 6000, 66006 , 7000 CYBER
8.9×10307 = 2210-1 ? ? ? ? ? DEC Vax G format; UNIVAC; 110x double1
1.8×10308 = 2210 15.9 64 211 1+52 IEEE 754 double
1.27×10322 = 21070 ? ? ? ? ? CDC 6x00, 7x00, Cyber1
9.99×10384 = 103×27+1 16.0 641010- 16d IEEE 754r decimal643 ,4
9.99×10499 = 10103/2 12.0 ? 103d 12d HP 71B13 , 851 calculators
9.99×10999 = 10103 12.0 ? 103d 12d Texas Instruments 85, 92 calculators
9.99×10999 = 10103 14.0 ? 103d 14d Texas Instruments 89 calculator13
9.99×10999 = 10103 17.0 82103d 17d 68881 Packed Decimal Real (3 BCD digits for exponent, 17 for mantissa, and two sign bits)
1.4×102465 = 2213-3 7.238? 214 24 Cray C90 half8
1.4×102465 = 2213-3 14.161? 214 47 Cray C90 single8
1.4×102465 = 2213-3 28.8110? 214 96 Cray C90 double8
1.1×102466 = 2213 ? ? ? ? ? Cray I1
5.9×104931 = 2214-1 ? ? ? ? ? DEC VAX H format1
1.2×104932 = 2214 19.2 80 215 64 The minimum IEEE 754 double exended size (Pentium; HP/Intel Itanium; Motorola 68040, 68881, 88110)
1.2×104932 = 2214 34.0128 2151+112 IEEE 754r quad2 ,3 (DEC Alpha9 ; IBM S/390 G510 )
9.99×106144 = 103×211+1 34.01281014- 34d IEEE 754r decimal1283 ,4
5.2×109824 = 2215-131 16.0 ? 216 47 PRIME 5016
1.9×1029603 = 8215+12 ? ? 816 ? Burroughs 6700, 7700 double1 ,16
4.3×102525222 = 2223 ? ? 224 ? PARI
1.4×10323228010 = 2230-1616 ? ? 231 ? Mathematica®
≅102147483646 = 10231-2 ? ? ? ? ? Maple®


   Legend:
   B : Base of exponent. This is the amount by which your floating-point number increases if you raise its exponent by 1. Modern formats like IEEE 754 all use base 2, so B is 2, and increasing the exponent field by 1 amounts to multiplying the number by 2. Older formats used base 8, 10 or 16.
   We : Width of exponent. If B is 2, 8 or 16, this is the number of bits (binary digits) in the exponent field. For the specific case of B=2, We is equal to K+1 in the equation 1-2K<e<2K specifying the bounds of the excess-2n exponent in an IEEE 754 representation (see below). When B is 10, there are two cases: "6d" indicates an exponent stored as base-10 digits and the letter d is included to make this clear; "8-" indicates an IEEE binary decimal format, using 2 bits in the combination field and 6 bits in the following exponent field, which together can hold only 3/4 of the values such a width would imply (because the high 2 bits cannot both be 1), thus the legal values are e such that 0≤e<3×26.
   Wm : Width of mantissa. For binary formats with "hidden" or "implied" leading mantissa bits, this is given as "1+N", such as "1+23", the "1+" refers to the leading 1 bit; this plus 23 actual bits gives a total of 24 bits of precision. For decimal formats the letter "d" is shown to make it clear the precision is in decimal digits.
  

IEEE 754 Single Representation
   This is worth describing in a bit more detail because it is so prevalent in the hardware used today, and it is probably what you'll be looking at when you try to decipher a floating-point value from its "raw binary".
   First a warning: Although the "normal" values are what you see when your program is working with real data, proper handling of the rest of the values (denorms, NANs, etc.) is vitally important; otherwise you'll get all sorts of horrible results that are difficult to understand, and usually impossible to fix.
   So, for the normal values (which in this case means, not including the zeros, denorms, NANs, and infinities) the value being represented can be expressed in the following form:
  
value = s 2k+1-N n

where the sign s is -1 or 1, and k and n are integers that fall within the ranges given by:
  
2-2K < k < 2K-1    and    2N-1-1 < n < 2N

for two integers K and N. If you look at the range of k and n you can see that k can have exactly 2K+1-2 values and n can have exactly 2N-1 values, and therefore exactly K+1 bits can be used to store the exponent (including two unused values discussed below) and N-1 bits to store the mantissa. To give a specific example, for IEEE 754 single precision, as the above table shows there are We=8 bits for the exponent and Wm=23 bits for the mantissa, so K is 7 and N is 24.
   The exponent is stored in "excess 2K format", which means the binary value you see is 2K bigger than the actual value of k being represented. For example, when K is 7, is the value 254 is seen, k is 126, and the value being represented is s 2127-N n. This is only true for the normal values just described, not for denorms.
  

The next set of values to understand are the denormalized values (or "denorms"), very small values for which
  
k = 2-2K    and    0 < n < 2N-1

using the same definitions as above. These values use one of the "unused" exponent values, namely the one that is all 0 bits. They are very important because they make overflow work better: instead of jumping suddenly to 0, you lose precision gradually as you go towards 0.
   In addition to making the underflow case a little less severe by losing precision gradually instead of suddenly, denormalized values eliminate a lot of strange bugs that would otherwise occur. For example, the tests "if x>y" and "if x-y>0" yield different results, unless you use denorms.
  

All of the various values are arranged in such a way that hardware or software can perform comparisons treating the data as signed-magnitude integers, and as long as neither argument is a NAN the proper answer will result. Such comparisons even properly handle the infinities and negative zero. (A signed-magnitude integer is a sign bit followed by an unsigned expression of its magnitude — this is not the normal signed integer format which is called "2's complement signed integer". As with floats, there are ways to express 0 as a signed-magnitude integer.)
  

Here are some sample values with their binary representation. The binary digits are broken into groups of 4 to help with interpreting a value in hexadecimal. They are shown in order from largest to smallest, with the non-numbers in the places they would fall if they were sorted by their bit patterns.
  
s exponent mantissa value(s)
0 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs
0 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
0 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
0 111.1111.1 000.0000.0000.0000.0000.0000 Infinity
0 111.1111.0 111.1111.1111.1111.1111.1111 3.402×1038
0 100.0000.1 000.0000.0000.0000.0000.0000 4.0
0 100.0000.0 100.0000.0000.0000.0000.0000 3.0
0 100.0000.0 000.0000.0000.0000.0000.0000 2.0
0 011.1111.1 000.0000.0000.0000.0000.0000 1.0
0 011.1111.0 000.0000.0000.0000.0000.0000 0.5
0 000.0000.1 000.0000.0000.0000.0000.0000 1.175×10-38 (Smallest normalized value)
0 000.0000.0 111.1111.1111.1111.1111.1111 1.175×10-38 (Largest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0001 1.401×10-45 (Smallest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0000 0
1 000.0000.0 000.0000.0000.0000.0000.0000 -0
1 000.0000.0 000.0000.0000.0000.0000.0001 -1.401×10-45 (Smallest denormalized value)
1 000.0000.0 111.1111.1111.1111.1111.1111 -1.175×10-38 (Largest denormalized value)
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38 (Smallest normalized value)
1 011.1111.0 000.0000.0000.0000.0000.0000 -0.5
1 011.1111.1 000.0000.0000.0000.0000.0000 -1.0
1 100.0000.0 000.0000.0000.0000.0000.0000 -2.0
1 100.0000.0 100.0000.0000.0000.0000.0000 -3.0
1 100.0000.1 000.0000.0000.0000.0000.0000 -4.0
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38
1 111.1111.0 111.1111.1111.1111.1111.1111 -3.402×1038
1 111.1111.1 000.0000.0000.0000.0000.0000 Negative infinity
1 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
1 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
1 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs
  


  

IEEE 754d Decimal Formats
   The decimal32, decimal64 and decimal128 formats defined in the proposed standard IEEE 754d are interesting largely because of their innovative packing of 3 decimal digits into 10 binary digits. Decimal formats are still useful because they can store decimal fractions (like 0.01) precisely. Normal BCD (binary-coded decimal) uses 4 binary digits for each decimal digit, requiring a waste of about 17% of the information capacity of the bits. The 1000 combinations of 3 decimal digits fit nearly perfectly into the 1024 combinations of 10 binary digits. In addition to the space efficiency, groups of 3 work well for formatting and printing which typically use a thousands separator (such as "," or a blank space) between groups of 3 digits. However, prospects for easy encoding and decoding seem bleak. In 1975 Chen and Ho published the first such system, but it had some drawbacks. The Cowlishaw encoding4 , used by IEEE 754d, is remarkable because it manages to achieve all of the following desirable goals:
  




  


   Minifloats and Microfloats: Excessively Small Floating-Point Formats
   Although they do not have much practical value as a universal format for computation, very small floating-point formats are of interest for other reasons.
   I refer to a format using 16 bits or less as a minifloat. Of these, the most popular by far is 1.5.10, a format using 1 sign bit, a 5-bit excess-15 exponent, 10 mantissa bits (with an implied 1 bit) and all the standard IEEE rules. The minimum and maximum representable values are 2.98×10-8 and 65504.
  

s exponent mantissa value(s)
0 111.11 xx.xxxx.xxxx various NANs
0 111.11 00.0000.0000 Infinity
0 111.10 11.1111.1111 65504 (Largest finite value)
0 100.11 10.1100.0000 27.0
0 100.01 11.0000.0000 7.0
0 100.00 10.0000.0000 3.0
0 011.11 00.0000.0000 1.0
0 011.10 00.0000.0000 0.5
0 000.01 00.0000.0000 6.104×10-5 (Smallest normalized value)
0 000.00 11.1111.1111 6.098×10-5 (Largest denormalized value)
0 000.00 00.0000.0001 6×10-8 (Smallest denormalized value)
0 000.00 00.0000.0000 0
1 011.11 00.0000.0000 -1.0 (other negative values are analagous)


This format is supported in hardware by the nVidia GeForce FX and Quadro FX 3D graphics cards (they call it fl16), and is used by Industrial Light and Magic (as part of their OpenEXR standard) and Pixar as the native format for raw output rendered frames (prior to conversion to a compressed format like DVD, HDTV, or imaging on photographic film for exhibition in a theater). When compared to 32-bit floating-point, it presents quite a few advantages beyond the obvious one of requiring half the memory space to store the value. Compared to IEEE binary32, an operation (such as addition or multiplication) takes less than half the time (as measured in gate delay) and about 1/4 as many transistors. This is very important when you are expected to perform trillions of such operations to render a frame. nVidia graphics cards implement the fastest floating-point available to consumers, somewhere around 40 billion operations per second, as compared to 12 billion for a 3-GHz Pentium. By the time you read this, ATI (which uses 24-bit 1.7.16 format) might surpass it, as they are always leapfrogging each other.
   The computer-graphics industry has long recognized the value of floating-point to represent pixels, because a pixel expresses (essentially) a light level. Light levels can vary over a very wide range — for example, the ratio between broad daylight and a clear night under a full moon, is 2.51214 ≅ 400,000. The ratio of brightnesses in nighttime environments with bright lights (such as when driving at night, or in a candlelit room) are similar. Such scenes have "high-contrast" lighting. The human eye can handle this range easily. A standard 8-bit format for pixel values (typically 8 bits for each of the three components red, green and blue) doesn't even come close. Doubling the pixel width to 16 bits produces the 48-bit format (common in the industry) but does little to improve the situation for high-contrast lighting — for pixel values near the bottom of the range, roundoff error is terrible. But using 1.5.10 float format increases the range to over 109, and retains the equivalent of 3 decimal digits of precision over the entire range.
  
A floating-point format using 8 bits or less fits in a byte; I call this a microfloat. These are the best for learning, particularly when you have to convert to/from floating-point using pencil and paper. I am not alone in thinking they are useful as an educational tool for learning about and practising the implementation of floating-point algorithms — I have found courses at no fewer than 11 colleges and universities that use them in lectures21 ,22 ,23 .
   But surprisingly, such small representations even have use in the real world — sort of. There are also some encodings used for data compression of waveforms and other time-variable analog data (such as mu-law coding used for audio) that very closely resemble a floating-point encoding with a small number of exponent and mantissa bits. Such codes usually store the logarithm of a value, plus a sign and have a special value for zero. This is not the same as a true floating-point format, but has a similar range and precision.
   At the extreme bottom end is the 1.2.2 format, using 1 sign bit, 2 exponent bits and 2 mantissa bits (plus an implied leading 1 bit for a mantissa precision of 3 bits). If the exponent is treated as "excess -2", all representable values are integers and the range is -24 to 24. This gives the format just slightly greater range than what you'd get if you treated the 5 bits as a normal two's complement integer (which has a range of -16 to 15). Any smaller format (such as the 4-bit format 1.2.1) has a smaller or equal range when compared to the same number of bits as an integer, so at that point the value of floating-point disappears entirely.
   Here is a table presenting the smaller entries from the main table in a somewhat different format, along with the integer-only formats that bias the exponent so that the smallest denorm is 1.
  

s.e.m excess range comments
1.2.2 -2 1 to 24 The smallest format worth considering; anything narrower has a range no greater than the same number of bits interpreted as a signed-magnitude integer
1.3.2 3 0.0625 to 14 Used in university courses21 ,22
1.4.3 -3 1 to 229376 About the best compromise for a 1-byte format
1.4.3 7 0.002 to 240 Used in university courses21 ,22
1.4.7 -7 1 to 4161536 One option for 12 bits
1.5.6 -6 1 to 1.35×1011 Another option for 12 bits
1.5.10 -10 1 to 2.20×1012 Largest unbalanced format; range exceeds 32-bit unsigned
1.5.10 15 0.000061 to 65504 2-byte excess-1524 ,25 ,27 , aka "fp16", "s10e5", "half". Can approximate any 16-bit unsigned integer or its reciprocal to 3 decimal places.
1.5.12 15 1/M to 2.81×1014 A fairly decent radix-8 format in an 18 bit PDP-10 word
1.7.16 63 1/M to 9.22×1018 3-byte excess-6317 , aka "fp24", "s16e7"


  


   Footnotes and Sources
  
1 : http://http.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf W. Kahan, "Why do we need a floating-point arithmetic standard?", 1981
  
2 : http://http.cs.berkeley.edu/~wkahan/ieee754status/Names.pdf W. Kahan, "Names for Standardized Floating-Point Formats", 2002 (work in progress)
  
3 : http://754r.ucbtest.org/ "Some Proposals for Revising ANSI/IEEE Std 754-1985"
  
4 : http://www2.hursley.ibm.com/decimal/DPDecimal.html "A Summary of Densely Packed Decimal encoding" (web page)
  
5 : http://www3.sk.sympatico.ca/jbayko/cpu1.html
  
6 : http://twins.pmf.ukim.edu.mk/predava/DSM/procesor/float.htm
  
7 : http://www.cc.gatech.edu/gvu/people/randy.carpenter/folklore/v5n2.html
  
8 : http://www.usm.uni-muenchen.de/people/puls/f77to90/cray.html
  
9 : http://www.usm.uni-muenchen.de/people/puls/f77to90/alpha.html
  
10 : http://www.research.ibm.com/journal/rd/435/schwarz.html and http://www.research.ibm.com/journal/rd/435/schwa1.gif
  
11 : http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html
  
12 : One source gave 831 as the range for the Burroughs B5500. (I forgot to save my source for this. I have sources for other Burroughs systems, giving 876 as the highest value (and 8-50 as the lowest, for a field width of 7 bits); and Cowlishaw13 describes its (unrelated) decimal floating-point format. I might have inferred it from http://www.cs.science.cmu.ac.th/panutson/433.htm which only gives a field width of 6 bits, and no bias. The Burroughs 5000 manual says the mantissa is 39 bits, but does not talk about exponent range. Did some models have a 6-bit exponent field? Since these are the folks who simplified things by storing all integers as floating-point numbers with an exponent of 017 , I suspect anything is possible.
  
13 : http://www2.hursley.ibm.com/decimal/IEEE-cowlishaw-arith16.pdf Michael F. Cowlishaw, "Decimal Floating-Point: Algorism for Computers", Proceedings of the 16th IEEE Symposium on Computer Arithmetic, 2003; ISSN 1063-6889/03
  
14 : http://research.microsoft.com/users/GBell/Computer_Structures_Principles_and_Examples/csp0146.htm D. Siewiorek, C. Gordon Bell and Allen Newell, "Computer Structures: Principles and Examples", 1982, p. 130
  
15 : http://research.microsoft.com/~gbell/Computer_Structures__Readings_and_Examples/00000612.htm C. Gordon Bell and Allen Newell, "Computer Structures: Readings and Examples", 1971, p. 592
  
16 : http://www.csit.fsu.edu/~burkardt/f_src/slap/slap.f90 FORTRAN-90 implementation of a linear algebra package, which curiously begins with a table of machine floating-point register parameters for lots of old mainframes. See also http://interval.louisiana.edu/pub/intervalmath/Fortran90_software/d1i1mach.for , which gives actual binary values of the smallest and largest values for many systems.
  
17 : http://grouper.ieee.org/groups/754/meeting-minutes/02-04-18.html Includes this brief description of the key design feature of the Burroughs B5500: "ints and floats with the same value have the same strings in registers and memory. The octal point at the right, zero exponent." This shows why the exponent range is quoted as 8-50 (or 8-51) to 876: The exponent ranged from 8-63 to 863, and the (for floating-point, always normalized) 13-digit mantissa held any value from 812 up to nearly 813, shifting both ends of the range up by that amount.
  
18 : http://www.inwap.com/pdp10/hbaker/pdp-10/Floating-Point.html
  
19 : http://nssdc.gsfc.nasa.gov/nssdc/formats/PDP-11.htm
  
20 : This format would be easy to implement on an 8-bit microprocessor. It has the sign and exponent in one byte, and a 16-bit mantissa and an explicit leading 1 bit (if the leading 1 is hidden/implied, we get twice the range). With only 4-5 decimal digits it isn't too useful, but it's what you could expect to see on a really small early home computer.
  
21 : http://turing.cs.plymouth.edu/~wjt/Architecture/CS-APP/L05-FloatingPoint.pdf This lecture presentation (or a variation of it) appears at clarkson.edu, plymouth.edu, sc.edu, ucar.edu, umd.edu, umn.edu, utah.edu, utexas.edu and vancouver.wsu.edu. Good discussion of floating-point representations, subnormals, rounding modes and various other issues. Pages 14-16 use the 1.4.3 microfloat format as an example to illustrate in a very concrete way how the subnormals, normals and NANs are related; pages 17-18 use the even smaller 1.3.2 format to show the range of representable values on a number line. Make sure to see page 30 — this alone is worth the effort of downloading and viewing the document!
  
22 : http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f98/H3/H3.pdf Homework assignment that uses the microfloat formats 1.4.3 and 1.3.2. Another similar one is here.
  
23 : http://www.arl.wustl.edu/~lockwood/class/cse306-s04/lecture/l11.html Lecture notes that use the 1.3.5 minifloat format for in-class examples.
  
24 : http://developer.nvidia.com/docs/IO/8230/D3DTutorial1_Shaders.ppt nVidia presentation describing their fp16 format (starting on slide 75).
  
25 : http://developer.nvidia.com/attach/6655 nVidia language specification including definition of fp16 format (page 175).
  
26 : http://www.cs.unc.edu/Events/Conferences/GP2/slides/hanrahan.pdf describes the nVidia GeForce 6800 and ATI Radeon 9800 graphics cards as general-purpose pipelined vector floating-point processors, and shows a rough design for a supercomputer employing 16384 of the GPU chips to achieve a theoretical throughput of 2 petaflops (2×1015 floating-point operations per second). (The rackmount pictured came from here.
  
27 : http://www.digit-life.com/articles2/ps-precision/ This is the only source I have found that describes all of the current hardware standard formats, from IEEE binary128 all the way down to nVIDIA s10e5.
  


Robert Munafo's home pages on Earthlink
© 1996-2004 Robert P. Munafo. Email the author
 
Back to my main page m.3