Understanding floating point values in a Delphi environment

Many debuting, and even more experienced, programmers are unfamiliar with the ins and outs of floating point values and their use, which leads often to unexpected program behaviour or strange output results. This article tries to focus on this topic in order to make you aware of the particular points of interest in regard to handling, storing and maintaining floating point values in Borland Delphi.

The first step is to look at the way binary floating point values (FPV) represent numbers. In fact, there is not one, but a series of different possible representations, depending on the precision you need for your purposes. The most important formats for storing and handling floating point values are the single, double and extended types. These formats are based on the IEEE standard and are directly supported by the CPU's Floating Point Unit (FPU) hardware, which is based on the i387 architecture. By contrast, the real48 format is not native to the Intel family of processors. Therefore, it has to be manipulated in software. Hence, it is extremely slow and tedious. Programmers should avoid the non-native 6-bytes wide real48 type(1) as much as possible. If, for compatibility purposes, it is necessary to use the real48 format, then you should limit its use as much as possible by converting to one of the other formats immediately after obtaining the real48 value and converting it back just before storage. We will focus our attention here on the native formats: single, double and extended.

Bit encoding of single type floats
Bit encoding of double type floats
Bit encoding of extended type floats

Single (32 bits)

bits 0-22 = Mantissa
bits 23-30 = Exponent
bit 31 = Sign

Double (64 bits)

bits 0-51 = Mantissa
bits 52-62 = Exponent
bit 63 = Sign

Extended (80 bits)

bits 0-63 = Mantissa
bits 64-78 = Exponent
bit 79 = Sign

Figure 1: The native floating point formats.

Let's have a closer look at the single format. As the table above shows, it represents the stored value as follows:

bits 0-22 = Mantissa; bits 23-30 = Exponent; bit 31 = Sign

The mantissa is in fact not 23, but 24 bits wide. So where is the 24th bit then? The answer is simple, but at the same time ingenious: the actual value is always stored normalised. This means that the number before the decimal separator (which can be 0 or 1 in our binary system, of course) is always 1. By decreasing the exponent part, the actual value can be sufficiently scaled in order to make sure the number before the decimal separator equals 1. Because the first number of the mantissa is always one, there is no need to store it! In this format, the first number is implicitly known to be one.

The exponent is biased: to get the real value, you have to subtract 127 from the stored value. Thus, when the exponent is less than 127, the result will be negative. Hence, the actual value will be less than 1. The value 255, or all bits set, of the exponent is reserved and indicates the NAN (Not A Number) value. The sign bit indicates the sign, with 0 equalling positive and 1 negative.

An example: a single variable with content $42996CE8

= 0 10000101 00110010110110011101000

Mantissa = 00110010110110011101000
         = 2-3 + 2-4 + 2-7 + 2-9 +2-10 + 2-12 + 2-13 + 2-16 + 2-17 + 2-18 + 2-20
         = 0.1986360549927

Add the implicit 1:

         = 1.1986360549927

Exponent = 10000101
         = 133

Subtract bias 127 
         = 6

Sign = 0 = positive

The stored value is thus: 

   1.1986360549927 * 26
=  1.1986360549927 * 64 
= 76.7127075195328

The double format follows the same rules but the mantissa and exponent parts are bigger and therefore can store numbers with a greater precision. The extended format, however, differs slightly from its single and double counterparts in that the integer part is explicitly stored in bit 63. This integer part in bit 63 absorbs any carry values, thus ensuring precision up to 19 digits. By contrast, in single and double formats, the integer part is always one. You should also realise that the FPU unit internally always works on extended types (also known as temporary real format). This means that every time you load a single or double value into the FPU, it is sized to the extended format.

As one can see, the floating point format has a major disadvantage: if the value can not be written as a limited and exact sequence of powers of the base number (2 for our binary system), the resulting value will only be an approximation of the desired value. Only numbers that are a true multiplication of 2 can be exactly represented.

For instance, the value 0.25 can be exactly represented in the single format as:

   0 01111101 00000000000000000000000
 = $3E800000

Mantissa = 00000000000000000000000
         = 0

Add the implicit 1 
         = 1

Exponent = 01111101 
         = 125

Subtract bias 127 
         = -2

Sign = 0 = positive

The value is thus: 

    1 * 2-2
  = 0.25

Now it is easier to understand that floating point numbers often do not represent exact values, but just close approximations of the desired value. You will have to take this into account when using FPV in your code. A major pitfall is testing for equality when using floating point values. Programmers want occasionally to do something like this:

if MyVar=2.1 then DoThis else DoThat;

MyVar being an FPV, the DoThis part would never be executed, simply because the key value of 2.1 can never be exactly represented in floating point format. Furthermore, in many cases, the contents of MyVar will be the result of several calculations beforehand. That would mean that even if the key value could be exactly represented in the binary Floating Point format, the calculation would involve other numbers that might be approximations in stead of exact representations, so that the final result would only be close to the desired value. Hence, the algorithm would fail. Possible techniques one could use to overcome this problem are to truncate or round the FPV to an integer:

if Round(MyVar)=2 then ...

You could easily scale the variable if you need precision involving one or more digits after the decimal separator:

if Round(MyVar * 10)=21 then ...

You could also use smaller than/greater than instead of comparing for equality:

if MyVar>2.1 then ...

In case you have to compare two floating points, do not compare them for equality, but compare the difference against a threshold:

if Abs(MyVarOne-MyVarTwo)<0.0001 then DoEqual else DoDifferent

Often, the best solution is to use (scaled) integers where possible. In Delphi, the Currency type is in fact a scaled integer to represent numbers with up to four digits after the decimal separator. So, the stored integer value 65481 in fact designates 6.5481. When performing calculations on integers, the resulting integer will be an exact representation of a certain number. Needless to say that integer handling in the CPU is more efficient and much faster than using the FPU. Do note however, that the Currency type, although it is an integer, is not handled in the CPU but as an integer in the FPU.

(1) Before D4, the non-native format was called Real in stead of Real48, while Real now indicates in fact a Double type.

Further reading

What Every Computer Scientist Should Know About Floating-Point Arithmetic”, by David Goldberg (link opens in a new window)
IEEE 754: Standard for Binary Floating-Point Arithmetic (link opens in a new window)


This page was last updated 16 May 2006.