Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. JavaScript uses 64-bit floating point representation, which is the same as Java's double
. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1
, which is 1/10
) whose denominator is not a power of two cannot be exactly represented.
For 0.1
in the standard binary64
format, the representation can be written exactly as
-
0.1000000000000000055511151231257827021181583404541015625
in decimal, or -
0x1.999999999999ap-4
in C99 hexfloat notation.
In contrast, the rational number 0.1
, which is 1/10
, can be written exactly as
-
0.1
in decimal, or -
0x1.99999999999999...p-4
in an analogue of C99 hexfloat notation, where the...
represents an unending sequence of 9's.
The constants 0.2
and 0.3
in your program will also be approximations to their true values. It happens that the closest double
to 0.2
is larger than the rational number 0.2
but that the closest double
to 0.3
is smaller than the rational number 0.3
. The sum of 0.1
and 0.2
winds up being larger than the rational number 0.3
and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
回答2:
A Hardware Designer's Perspective
I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen, and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operationwhich is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.
2. Standards
Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.
In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.
3. Cause of Rounding Error in Division
The main cause of the error in floating point division, are the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y
, Z = X * (1/Y)
. Division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in slow division, and the size in bits of the quotient selection table is usually the width of the radix, or number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2
. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4
bits (plus a few optional bits).
3.1 Division Rounding Error: Approximation of Reciprocal
What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case though, all reciprocals are approximations of the actual reciprocal, and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.
4. Rounding Errors in Other Operations: Truncation
Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default),round-down, and round-up. All methods introduce an element of error of less than one half of one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error, is especially problematic in exponentiation, which involves some form of repeated multiplication.
5. Repeated Operations
Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because over time, the errors are more likely to cancel each other out, and Interval Arithmeticcombined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.
Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.
6. Summary
In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.
相关推荐
本文主要讲了如何把float类型变量发送到串口调试助手的方法,下面一起来学习一下
在进行工控数据采集时,很多设备都是使用RS485 接口传输modbus RTU协议数据,比较常见的是,传输一个字(2个bytes),但是有些变量超过了两个字节,比如float类型、int、long、double类型数据在modbus RTU传输时,该...
C语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 FLOATC语言头文件 ...
4BYTE字节转换成float型数据代码
16位转浮点型float,MODBUS 32位转浮点型float 64位转双浮点型double
其中有自己写的atoi,itoa,和自己写的printf,但是没有float的输出
做串口通信时,下位机输入的是16进制数据,如何将之转换成float数据?
float2hex浮点数转换工具.rar
python实现,将float数据转换成二进制数据;
float2hex 浮点二进制整型数据计算工具
float变量在内存当中存储格式.doc
Java的基本数据类型包括整型(byte、short、int、long)、浮点型(float、double)、字符型(char)和布尔型(boolean)。 3. **Java中的访问修饰符有哪些?** Java中的访问修饰符包括public、protected、...
在高精度的嵌入式计算中,经常用到浮点数和双精度浮点数,如何在软件中定义数据类型,并实现将所所需的数据输出到监控软件上是至关重要的。明白数据类型定义才能对数据进行正确地操作。
将浮点数转ASCII码,已测试,可正常使用。
FLOAT型的二进制输出
float ek2 float uk float uk1 float ad_just //变量初始化,根据实际情况初始化 Kp= Ti= T= Td=; a0=Kp*(1+T/Ti+Td/T) a1=Kp*(1+2*Td/T) a2=Kp*Td/T //Ki=KpT/Ti=0.8,微分系数Kd=KpTd/T=0.8,Td=...
float piefloat pie float pie float pie float pie
字节位数和float型转换小工具,可实现双向转换,占容小,速度快捷
在excel中实现浮点数转16进制Float2Hex, VBA熟练的人一看就懂,导入即可实现Float2Hex()函数。
把十进制实数转换成二进制增强版-float2bin.m 由于MATLAB中的dec2bin只能处理正整数,所以我写了一个新的转换函数float2bin 它可以处理正数、负数、整数、浮点数 我加了一个参数length,用来应对某些十进制小数...