Vocademy

Floating Point Numbers

What about decimal fractions? That's fairly simple. Extra bits are added to represent the position of the decimal point. The only problem is that the numbers to the right of a decimal point represent a decimal fraction. This has to be converted to a binary fraction. For example, let's take the number 0.5. The binary equivalent of 5 is 101. However, 0.101 doesn't represent 1/2 in binary as 0.5 does in decimal. The quantity of 1/2 is represented by 0.1 in binary notation. Therefore, the number 1.5 in decimal becomes 1.1 in binary. Likewise, 1.25 becomes 1.01, and 1.125 becomes 1.001. Being a binary number we can't call the point a decimal point. It becomes a binary point.

Now, moving the binary point to the left divides the number by two, and moving it to the right multiplies the number by two. This is equivalent to moving a decimal point right and left, which multiplies and divides a decimal number by 10.

Now let's make a floating point binary number from the decimal number 4.125. Using an 8-bit binary number 4.125 is 0100.0010. How do we represent that binary point? A binary system can represent two things. With binary numbers those two things are 1 and 0. Adding a decimal point makes three things. There is no facility to represent the decimal point, so we need to use extra bits to say where to put the decimal point. Let's just add an extra byte for that. Let's use one byte for the bits representing the number in question (the significand[1]) without the binary point, and another representing how many places from the left to place the binary point. In this example, the two bites could be 01000010 00000100.

01000010   00000100
     
Bits representing the significand.   Bits telling to place the binary point four places from the right.

The decimal number 4.125 or binary 0100.0010 in floating-point binary notation. The left byte represents the bits of the number 4.125, and the right number tells says to put the binary point four places from the right.

In practice, the byte representing the position of the binary point precedes the byte representing the number, so let's swap the bytes to follow that convention. This would be seen as 00000100 01000010.

In practice, floating point numbers are represented in base 2 scientific notation. If for no other reason, this removes leading zeros from the number that just use up precious bits that could be repsenting digits. For example, using a 32-bit number in our example format, 4.125 becomes  00010100 00000000 00000000 01000010. This uses 24 bits to represent the significand, with the first eight bits saying to place the binary point 20 places to the right. Using binary scientific notation, we remove all the zeros to the left of the most significant non-zero bit, then we assume that the decimal point is one place to the right. Now our number is represented with 11111110 10000100 00000000 00000000. The binary point is assumed to be after the first bit, so this specific series of bits says the number being represented is 1.00001 (ignoring trailing zeros) but with the binary point moved two places to the right, making the actual number 100.001.

11111110 10000100 00000000 00000000

4.125 represented in binary scientific notation. The first byte (green) represents how many positions to move the binary point. The following 24 bits represent 1.00001. The first byte  is -2 saying to move the binary point two places to the right. This makes the final number 100.001.

This method works perfectly if the decimal fraction is an even fraction of two. However, other fractions create numbers that may be irrational. For example, 0.23 is 0.001110101110000101000 using 24 bits, and is irrational, so no number of bits can hold the complete number. Therefore, converting decimal fractions into binary fractions can result in rounding errors. This can be avoided by using binary-coded decimal (discussed later) but most systems represent fractions as binary fractions.

IEEE 754

Finally, let's look at a real-world format for representing floating point binary numbers. This is the IEEE 754 standard that is used in most computer systems. Let's examine the 32-bit version of this standard.

IEEE 754 represents the significand with an unsigned integer so the first bit represents the sign of the significand. This is followed by the exponent in excess 127 notation, then the significand. Since the number is represented in base-2 scientific notation, and the first bit of any significand (except zero) in base-2 scientific notation is always 1, the leading one is removed and just assumed to exist. This results in an extra bit to represent the significand.

Let's look at the number 4.125 represented as a 32-bit number in the IEEE 754 format.

0 10000001 00001000000000000000000

4.125, represented in the IEEE 754 format.

The first bit (red) represents the sign of the significand. Zero is positive and one is negative.

The next 8 bits (green) represent the exponent or the location of the binary point expressed in excess 127 notation (some math operations produce ambiguous results with excess 128 notation). Here the actual number is 129, which is 127 + 2, meaning to move the binary point two places to the right of its assumed location.

The final 23 bits represent the significand without the leading one or the binary point. The actual number is 1.00001000000000000000000.

The exponent says to move the binary point two places to the right, so the final number is 100.001 (ignoring the trailing zeros), which represents decimal 4.125.

—————————
1The significand may also be called the mantissa, fraction, arguement, or coefficient. William Kahan, creator of the IEEE 754 standard prefers significand.
Vocademy