如何判断R中的数据集将会过大？(How can I tell when my dataset in R is going to be too large?)

编程入门行业动态更新时间:2024-10-28 16:21:28

我将在R中进行一些日志文件分析（除非我不能在R中完成这项工作），并且我知道我的数据需要适合RAM（除非我使用某种类型的修复方法，例如keyval存储的接口，也许？）。所以我想知道如何提前告诉我的数据将在RAM中占用多少空间，以及我是否有足够的空间。我知道我有多少内存（数量不是很多 - 在XP下为3GB），并且我知道我的日志文件最终会有多少行和多少列，以及列条目应该包含哪些数据类型（这大概需要检查它读取）。

我如何将这些结合起来用于在R中进行分析的go / nogo决策？（据推测R需要能够有一些RAM来执行操作，以及保存数据！）我的直接需求输出是一些简单的汇总统计，频率，意外事件等，所以我可能会写一些解析器/制表工具，它可以给我短期的输出结果，但是我也想在下一步中使用很多不同的方法来处理这些数据，所以我们在研究使用R的可行性。

我已经看到很多有关R中大数据集的有用建议，我已阅读并将重读，但现在我想更好地理解如何确定我是否应该（a）完全去那里，（b）去但希望不得不做一些额外的事情以使其易于管理，或（c）在太迟之前逃跑并在其他语言/环境中做某些事情（建议欢迎...！）。谢谢！

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads).

How do I put this together into a go/nogo decision for undertaking the analysis in R? (Presumably R needs to be able to have some RAM to do operations, as well as holding the data!) My immediate required output is a bunch of simple summary stats, frequencies, contingencies, etc, and so I could probably write some kind of parser/tabulator that will give me the output I need short term, but I also want to play around with lots of different approaches to this data as a next step, so am looking at feasibility of using R.

I have seen lots of useful advice about large datasets in R here, which I have read and will reread, but for now I would like to understand better how to figure out whether I should (a) go there at all, (b) go there but expect to have to do some extra stuff to make it manageable, or (c) run away before it's too late and do something in some other language/environment (suggestions welcome...!). thanks!

最满意答案

R非常适合大数据集，可以使用像bigmemory或ff软件包（尤其是read.csv.ffdf ）的开箱即用解决方案，或者使用自己的脚本以块的形式处理您的东西。在几乎所有情况下，一些小程序都可能会处理大型数据集（>>内存，例如100 Gb）。自己做这种编程需要一些时间来学习（我不知道你的水平），但是让你真正的灵活。如果这是你的一杯茶，或者如果你需要跑步取决于你想投资学习这些技能的时间。但是，一旦拥有了它们，它们将使您作为数据分析师的生活变得更加轻松。

关于分析日志文件，我知道从使命召唤4（计算机多人游戏）生成的统计页面通过迭代地解析日志文件到数据库中，然后从数据库中检索每个用户的统计信息。看到这里的一个接口的例子。迭代（以块为单位）方法意味着日志文件的大小是（几乎）无限的。然而，获得好的表现并不是微不足道的。

在R中你可以做的很多事情，你可以用Python或Matlab，甚至C ++或Fortran来完成。但是，只有当该工具对您想要的功能具有开箱即用的支持时，才能看到该工具比R具有明显的优势。有关处理大量数据的信息，请参阅HPC任务视图。另请参阅早期的分钟读取大块文本文件的答案。其他相关的链接可能会让你感兴趣：

在R中快速读取非常大的表格作为数据框 https://stackoverflow.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing （讨论包括用于大数据处理）。修剪一个巨大的（3.5 GB）csv文件以读入R 我的博客文章展示了如何估计数据集的RAM使用情况。请注意，这假定数据将存储在矩阵或数组中，并且只是一种数据类型。用R处理日志文件

关于选择R或其他工具，我会说，如果Google足够好，对我来说就足够了;）。

R is well suited for big datasets, either using out-of-the-box solutions like bigmemory or the ff package (especially read.csv.ffdf) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.

In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.

A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:

Quickly reading very large tables as dataframes in R https://stackoverflow.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing). Trimming a huge (3.5 GB) csv file to read into R A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype. Log file processing with R

In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).

更多推荐

本文发布于:2023-04-28 08:04:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1331370.html