简体繁体 English

在 R 中处理大数据的有效方法

[英]Efficient way to handle big data in R

原文 2017-10-08 15:03:27 4 1 r/ bigdata

I have a huge csv-file, 1.37 GB, and when running my glm in R, it crashes because I do not have enough memory allocated.我有一个巨大的 csv 文件，1.37 GB，当在 R 中运行我的 glm 时，它崩溃了，因为我没有分配足够的内存。 You know, the regular error..你知道，常规错误..

Are there no alternative to packages ff and bigmemory, because they do not seem to work well for me, because my columns are a mix of integer and characters, and it seems with the two packages I have to specify what type my columns are, either char or integer.有没有包 ff 和 bigmemory 的替代品，因为它们对我来说似乎不太好用，因为我的列是整数和字符的混合，而且似乎对于这两个包，我必须指定我的列是什么类型，要么字符或整数。

We are soon in 2018 and about to put people on Mars;我们很快就会在 2018 年将人们送上火星； are there no simple "read.csv.xxl" function we can use?没有我们可以使用的简单的“read.csv.xxl”函数吗？

1 个解决方案

I would first address your question by recognizing that just because your sample data takes 1.37 GB does not at all mean that 1.37 GB would be satisfactory to do all your calculations using the glm package.只是因为你的样本数据需要1.37 GB根本不都意味着1.37 GB将是令人满意的做用所有的计算我会首先认识到解决您的问题glm包。 Most likely, one of your calculations could spike at at least a multiple of 1.37 GB.最有可能的是，您的一项计算可能至少达到 1.37 GB 的倍数。

For the second part, a practical workaround here would be to just take a reasonable sub sample of your 1.37 GB data set.对于第二部分，这里的一个实际解决方法是仅对 1.37 GB 数据集进行合理的子样本。 Do you really need to build your model using all the data points in the original data set?您真的需要使用原始数据集中的所有数据点来构建模型吗？ Or, would say a 10% sub sample also give you a statistically significant model?或者，是否会说 10% 的子样本也会为您提供具有统计显着性的模型？ If you lower the size of the data set, then you solve the memory problem with R.如果你降低数据集的大小，那么你就可以用 R 解决内存问题。

Keep in mind here that R runs completely in-memory, meaning that once you have exceeded available memory, you may be out of luck.请记住，R 完全在内存中运行，这意味着一旦超出可用内存，您可能会倒霉。