大数据集的R内存管理

Question

I am quite new to R and I am currently working on a dataset (size: 2 GB size). 我对R很陌生，目前正在处理数据集（大小：2 GB大小）。 I have stored this dataset in a workspace and whenever I load this dataset into R it consumes more than 90% main memory and hence it becomes difficult and highly time consuming to perform operations like filtering, processing and analyzing the data. 我已经将此数据集存储在工作区中，并且每当将数据集加载到R中时，它将消耗90％以上的主内存，因此执行诸如过滤，处理和分析数据之类的操作变得困难且非常耗时。

I am mainly using dplyr package to filter and form subsets out of the main dataset, as per dynamic user inputs, but it is taking a lot of time to fetch the data. 根据动态用户输入，我主要使用dplyr包从主要数据集中过滤并形成子集，但是获取数据需要花费很多时间。 I have also tried using bigmemory package. 我也尝试过使用bigmemory包。 While it solves the memory consumption issues, it does not allow the dplyr package functions on big.matrix objects. 虽然它解决了内存消耗问题，但不允许在big.matrix对象上使用dplyr包功能。

So can anyone please let me know how can I filter large datasets quickly with optimal memory consumption? 因此，有人可以让我知道如何以最佳的内存消耗快速过滤大型数据集吗？

Thanks! 谢谢！

Answer 1

One approach might be to truncate the table to the columns needed while downloading. 一种方法可能是将表截断为下载时所需的列。

Example 例

sample1 <- read.csv("https://www.sample-videos.com/csv/Sample-Spreadsheet-500000-rows.csv",
                    header=TRUE)

sample2 <- read.csv("https://www.sample-videos.com/csv/Sample-Spreadsheet-500000-rows.csv",
                    header=TRUE)[, c(1, 3, 5)]
> object.size(sample1)
3272064 bytes
> object.size(sample2)
1073240 bytes

To know which to choose, consult the colnames using 要知道选择哪个，请使用

var.names <- names(read.csv("https://www.sample-videos.com/csv/Sample-Spreadsheet-500000-rows.csv",
                    header=TRUE))
> var.names
[1] "Eldon.Base.for.stackable.storage.shelf..platinum"
[2] "Muhammed.MacIntyre"                              
[3] "X3"                                              
[4] "X.213.25"                                        
[5] "X38.94"                                          
[6] "X35"                                             
[7] "Nunavut"                                         
[8] "Storage...Organization"                          
[9] "X0.8"

Answer 2

I have used data.table and DT packages. 我已经使用了data.table和DT包。 Using data.table I created a data.table object which is fast to access and analyse. 使用data.table我创建了一个data.table对象，该对象可以快速访问和分析。 And using DT package, I used the renderDataTable function to get the details quickly on Shiny dashboard. 然后使用DT包，使用renderDataTable函数在Shiny仪表板上快速获取详细信息。

Answer 3

I have used data.table and DT packages. 我已经使用了data.table和DT包。 Using data.table I created a data.table object that can be accessed and analysed quickly. 使用data.table我创建了一个data.table对象，该对象可以快速访问和分析。 And using renderDataTable function from DT package, I was able to render the table on Shiny dashboard quickly. 使用DT包中的renderDataTable函数，我能够在Shiny仪表板上快速呈现表。 Thanks all for your help!! 感谢你的帮助！！

大数据集的R内存管理

问题描述

3 个解决方案

解决方案1
0 2018-06-05 14:57:46

解决方案2
0 2018-06-06 07:02:53

解决方案3
0 已采纳 2018-06-06 07:06:15

大数据集的R内存管理

问题描述

3 个解决方案

解决方案1 0 2018-06-05 14:57:46

解决方案2 0 2018-06-06 07:02:53

解决方案3 0 已采纳 2018-06-06 07:06:15

解决方案1
0 2018-06-05 14:57:46

解决方案2
0 2018-06-06 07:02:53

解决方案3
0 已采纳 2018-06-06 07:06:15