针对不变性的Scala编译器优化(Scala compiler optimization for immutability)

系统教程行业动态更新时间:2024-06-14 16:59:17

scala编译器是否通过将refs移除到块中仅使用一次的val来优化内存使用？

想象一个对象保存一些巨大的数据 - 达到克隆数据或其衍生物可能会刮掉JVM /机器的最大内存量的大小。

一个最小的代码示例，但想象一下更长的数据转换链：

val huge: HugeObjectType val derivative1 = huge.map(_.x) val derivative2 = derivative1.groupBy(....)

在计算了derivative1之后，编译器是否会留下huge标记符合垃圾回收的条件？或者它会保持活着状态，直到退出包装块为止？

理论上不可变性是很好的，我个人觉得它让人上瘾。但是要适应那些无法在当前操作系统上逐项进行流处理的大数据对象 - 我认为它本质上是阻抗与内存利用率不匹配，因为JVM上的大数据应用程序不是除非编译器针对这种情况进行优化，否则就是这样。

Does the scala compiler optimize for memory usage by removing refs to vals used only once within a block?

Imagine an object holding in aggregate some huge data - reaching a size where cloning data or derivatives of it may well scratch the maximum amount of memory for the JVM/machine.

A minimal code example, but imagine a longer chain of data transforms:

val huge: HugeObjectType val derivative1 = huge.map(_.x) val derivative2 = derivative1.groupBy(....)

Will the compiler e.g. leave huge marked eligible for garbage collection after derivative1 has been computed? or will it keep it alive until the wrapping block is exited?

Immutability is nice in theory, I personally find it addictive. But to be a fit for big data objects that can't be stream-processed item by item on current-day operating systems - I would claim that it is inherently impedance mismatched with reasonable memory utilization, for a big data application on the JVM isn't it, unless compilers optimize for such things as this case..

最满意答案

首先：只要JVM GC认为有必要，就会实际释放未使用的内存。因此，scalac无法做到这一点。

scalac唯一可以做的就是将引用设置为null，不仅仅是当它们超出范围时，而是在它们不再使用时。

基本上

val huge: HugeObjectType val derivative1 = huge.map(_.x) huge = null // inserted by scalac val derivative2 = derivative1.groupBy(....) derivative1 = null // inserted by scalac

根据scala-internals上的这个帖子，它目前没有这样做，最新的热点JVM也没有提供打捞。通过scalac黑客Grzegorz Kossakowski和该主题的其余部分查看帖子。

对于由JVM JIT编译器优化的方法，JIT编译器将尽快使引用为空。但是，对于只执行一次的主要方法，JVM将永远不会尝试完全优化它。

上面链接的主题包含对该主题和所有权衡的详细讨论。

请注意，在典型的大数据计算框架（如apache spark）中，您使用的值不是对数据的直接引用。因此，在这些框架中，引用的生命周期通常不是问题。

对于上面给出的示例，所有中间值仅使用一次。所以一个简单的解决方案是将所有中间结果定义为defs。

def huge: HugeObjectType def derivative1 = huge.map(_.x) def derivative2 = derivative1.groupBy(....) val result = derivative2.<some other transform>

一种不同但非常有效的方法是使用迭代器！像迭代filter上的map和filter这样的链接函数逐项处理它们，导致没有实现中间集合..这非常适合场景！这对于像groupBy这样的函数没有帮助，但可能会显着减少前一个函数和类似函数的内存分配。从上面提到的Simon Schafer的学分。

First of all: the actual freeing of unused memory happens whenever the JVM GC deems it necessary. So there is nothing scalac can do about this.

The only thing that scalac could do would be to set references to null not just when they go out of scope, but as soon as they are no longer used.

Basically

val huge: HugeObjectType val derivative1 = huge.map(_.x) huge = null // inserted by scalac val derivative2 = derivative1.groupBy(....) derivative1 = null // inserted by scalac

According to this thread on scala-internals, it currently does not do this, nor does the latest hotspot JVM provide salvage. See the post by scalac hacker Grzegorz Kossakowski and rest of that thread.

For a method that is being optimised by the JVM JIT compiler, the JIT compiler will null references as soon as possible. However, for a main method that is executed only once, the JVM will never attempt to fully optimise it.

The thread linked above contains a pretty detailed discussion of the topic and all the tradeoffs.

Note that in typical big data computing frameworks such as apache spark, the values you work with are not direct references to the data. So in these frameworks the lifetime of references is usually not a problem.

For the example given above, all intermediate values are used exactly once. So an easy solution is to just define all intermediate results as defs.

def huge: HugeObjectType def derivative1 = huge.map(_.x) def derivative2 = derivative1.groupBy(....) val result = derivative2.<some other transform>

A different yet very potent approach, is to use iterators! chaining functions like map and filter over an iterator processes them item by item, resulting in no intermediary collections ever being materialized.. which fits the scenario very well! this will not help with functions like groupBy but may significantly reduce memory allocation for the former functions and similar ones. Credits to Simon Schafer from the above mentioned.

更多推荐

本文发布于:2023-04-16 14:36:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/dzcp/e44f181fcec81e675fae5ae602b60b3d.html