在Scala中超出CSV CSV读取的GC开销限制(GC Overhead Limit Exceeded for Large CSV Reads in Scala)

编程入门行业动态更新时间:2024-10-25 15:30:18

所以。我正在使用Scala，而且我对它相对比较陌生（主要是一个python家伙）。我正在编译并通过sbt运行我的代码。我在一个Ubuntu机器上，目前运行Java 6。我有两个CSV; 我需要把他们，处理他们，然后操纵他们。每个CSV是〜250MB; 如果这有效，我可能会用更大的CSV重复此过程。

我定义了一个函数，它读入CSV并将每行写入我需要的数据结构。我在每个CSV系列中调用此函数。问题是：它对于第一个CSV完美（并且非常快）返回，但第二个总是抛出java.lang.OutOfMemoryError: GC overhead limit exceeded错误。

我已经尝试了很多东西。我的build.sbt定义了javaOptions += "-Xmx20480m -XX:+HeapDumpOnOutOfMemoryError" ; 我已经尝试使用-XX:-UseGCOverheadLimit以及，但似乎没有任何帮助。根据我一直在阅读的Java文档，这个错误表明大量的系统资源被用于垃圾回收 - 但我坦率地不清楚它是什么垃圾收集，或者如何修剪它。我认为我的功能必须......泄漏内存，或者我必须错误地使用Scala，但我看不出如何。

这是我的功能：

def readAndProcessData(path: String) = { val fileLines = Source.fromFile(path).getLines.drop(1) val ret = mutable.Map[String, List[Tuple2[String, String]]]() def addRowToRet(row: String) = { val rowArray = row.split(",") if (!(ret contains rowArray(0))) { ret.update(rowArray(0), List[Tuple2[String, String]]()) } ret(rowArray(0)) = Tuple2(rowArray(1), rowArray(2)) :: ret(rowArray(0)) } for (row <- fileLines) { addRowToRet(row) } ret.map{tup => (tup._1 -> tup._2.sorted)} }

谢谢！

So. I'm using Scala, and I'm relatively new to it (mostly a python guy). I'm compiling and running my code via sbt. I'm on an Ubuntu box, currently running Java 6. I have two CSVs; I need to take them, process them, then manipulate them. Each CSV is ~250mb; if this works I'm likely to repeat this process with much larger CSVs.

I've defined a function that reads in a CSV and writes each row into the data structure I need. I call this function on each CSV in series. Problem is: it returns perfectly (and very quickly) for the first CSV, but the second one always throws a java.lang.OutOfMemoryError: GC overhead limit exceeded error.

I've tried rather a number of things. My build.sbt defines javaOptions += "-Xmx20480m -XX:+HeapDumpOnOutOfMemoryError"; I've tried using -XX:-UseGCOverheadLimit as well, but that doesn't seem to help anything. According to the Java docs I've been reading, that error indicates that a huge amount of system resource is being spent on garbage collection -- but I'm frankly unclear what it's garbage collecting, or how to trim it down. I assume my function must be... leaking memory somewhere, or I must be mis-using Scala, but I can't see how.

Here's my function:

Thanks!

最满意答案

首先，如果您不需要运行，请启用分叉或增加sbt的内存限制并删除javaOptions设置。分叉可能是一个好主意，所以你不会混淆你的程序的内存使用行为和sbt的行为。

你也应该关闭你正在创建的Source对象，以确保它的资源被释放。

它是否在一个统一的地方崩溃，例如在分拣时？或者崩溃发生在代码中相当随机的地方？

我假设你正在读取的文件是ASCII或UTF8等编码，其中大多数字符都用8位表示。 Java使用每个字符16位，所以请记住，通过将它读入Java字符串中，您的大小会增加一倍以上（“多于”是由于其他开销）。这本身不应该推动你，但这意味着当你有两个250MB的文件加载时，你可能会消耗超过1GB的数据内存。

您的密钥相对于文件中行数的分布情况如何？换句话说，你的地图中是否有几乎每条线的条目，大约一半的线，四分之一等？你可能会有一个非常大的地图（在条目方面），当你执行操作它的“地图”来对值进行排序时，你最终会记住它们中的两个，直到函数返回并且旧函数变为可收集。您也可能想尝试使用不可变映射或Java可变映射的包装。有时Scala的可变数据结构不如它们不可变的数据结构那样健壮。

另外，我从来没有与scala.io.Source交过好运。如果在相当确定的情况下它仍然失败，那么实际上已经分配了足够的内存，您可能需要尝试使用Java的IO库。

最后，如果检查一些设置并且戳一下就行不通，那么应该在它上面挂接一个内存分析器，比如VisualVM 。那就是你在找出问题所在，而不是通过修改来进行猜测和检查的时候有所准备。

First, if you're not forking to run, either enable forking or up the memory limit for sbt and remove the javaOptions setting. Forking may be a good idea here so you are not intermixing the memory usage behavior of your program with that of sbt.

You also should close the Source object you are creating to make sure its resources are released.

Is it crashing in a consistent place, e.g. when sorting? Or does the crash occur at pretty random spots in the code?

I assume that the files you are reading are in an encoding such as ASCII or UTF8 where most to all of the characters are represented with 8 bits. Java uses 16 bits per character, so keep in mind that you are more than doubling the size (the "more than" is due to other overheads) by reading it into Java strings. That in itself shouldn't push you over, but it means by the time you have two 250MB files loaded you'll probably be consuming more than 1GB of memory for the data.

How distributed are your keys relative to the number of lines in your files? In other words, is there an entry in your map for almost every line, for about half the lines, a quarter, etc? You potentially could have a pretty big map (in terms of entries) and when you perform the "map" operating on it to sort the values you'll end up with two of them in memory until the function returns and the old one becomes collectible. You also might want to try using an immutable map or a wrapper around a Java mutable map. Sometimes Scala's mutable data structures aren't as robust as their immutable counterparts.

Also, I've never had good luck with scala.io.Source. If it's still failing once you're fairly certain you actually have enough memory allocated, you might want to try dropping down to using Java's IO libraries.

Finally, if checking a few settings and poking at it a bit doesn't work, you should hook up a memory profiler to it such as VisualVM. That was you have a shot at figuring out where your problem really is rather than doing guess-and-check with modifications.

更多推荐

本文发布于:2023-07-20 12:41:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1199642.html