分解浮点数(Decompose floating

编程入门行业动态更新时间:2024-10-21 19:09:18

分解浮点数(Decompose floating-point number)

给定一个浮点数，我想将它分成一个部分之和，每个部分都有一个给定的位数。例如，给定3.1415926535并告诉它将其分成10个基本的10个部分，每个4位数，它将返回3.141 + 5.926E-4 + 5.350E-8。实际上，我想将一个double（具有52位精度）分成三个部分，每个部分具有18位精度，但是用base-10示例更容易解释。我不一定反对使用标准双精度IEEE浮点数的内部表示的技巧，但我真的更喜欢一个纯粹保留在浮点范围内的解决方案，以避免任何与字节序依赖或非标准相关的问题浮点表示。

不，这不是一个家庭作业问题，是的，这有实际用途。如果要确保浮点乘法是精确的，则需要确保乘以的任何两个数字永远不会超过浮点类型中有空格的数字的一半。从这种分解开始，然后将所有部分相乘并进行卷积，就是这样做的一种方法。是的，我也可以使用任意精度的浮点库，但只涉及几个部分时，这种方法可能会更快，而且它肯定会更轻。

Given a floating-point number, I would like to separate it into a sum of parts, each with a given number of bits. For example, given 3.1415926535 and told to separate it into base-10 parts of 4 digits each, it would return 3.141 + 5.926E-4 + 5.350E-8. Actually, I want to separate a double (which has 52 bits of precision) into three parts with 18 bits of precision each, but it was easier to explain with a base-10 example. I am not necessarily averse to tricks that use the internal representation of a standard double-precision IEEE float, but I would really prefer a solution that stays purely in the floating point realm so as to avoid any issues with endian-dependency or non-standard floating point representations.

No, this is not a homework problem, and, yes, this has a practical use. If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type. Starting from this kind of decomposition, then multiplying all the parts and convolving, is one way to do that. Yes, I could also use an arbitrary-precision floating-point library, but this approach is likely to be faster when only a few parts are involved, and it will definitely be lighter-weight.

最满意答案

如果要确保浮点乘法是精确的，则需要确保乘以的任何两个数字永远不会超过浮点类型中有空格的数字的一半。

究竟。这种技术可以在Veltkamp / Dekker乘法中找到。虽然可以像在其他答案中一样访问表示的位，但您也可以仅使用浮点运算。这篇博文中有一个例子。您感兴趣的部分是：

Input: f; coef is 1 + 2^N p = f * coef; q = f - p; h = p + q; // h contains the 53-N highest bits of f l = f - h; // l contains the N lowest bits of f

* ， -和+必须完全符合IEEE 754操作，精度为f才能工作。在英特尔架构上，这些操作由SSE2指令集提供。 Visual C在其编译的C程序的前奏中将历史FPU的精度设置为53位，这也有帮助。

If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type.

Exactly. This technique can be found in Veltkamp/Dekker multiplication. While accessing the bits of the representation as in other answers is a possibility, you can also do with only floating-point operations. There is one instance in this blog post. The part you are interested in is:

Input: f; coef is 1 + 2^N p = f * coef; q = f - p; h = p + q; // h contains the 53-N highest bits of f l = f - h; // l contains the N lowest bits of f

*, -, and + must be exactly the IEEE 754 operations at the precision of f for this to work. On Intel architectures, these operations are provided by the SSE2 instruction set. Visual C sets the precision of the historical FPU to 53 bits in the prelude of the C programs it compiles, which also helps.

更多推荐

本文发布于:2023-07-05 07:00:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1034423.html