简体   繁体   English

预测R中的内存使用情况

[英]Predict memory usage in R

I have downloaded a huge file from the UCI Machine learning Dataset library. 我从UCI机器学习数据集库下载了一个巨大的文件。 (~300mb). (〜300MB)。

Is there a way to predict the memory required to load the dataset, before loading it into R memory? 在将数据集加载到R内存之前,有没有办法预测加载数据集所需的内存?

Googled a lot, but everywhere all I could find is how to calculate memory with R-profiler and several other packages, but after loading the objects into R. 谷歌搜索了很多,但我发现的所有地方都是如何使用R-profiler和其他几个包计算内存,但是在将对象加载到R之后。

based on "R programming" coursera course, U can calculate the proximate memory usage using number of rows and columns within the data" U can get that info from the codebox/meta file" 基于“R编程”课程,U可以使用数据中的行数和列数来计算近似内存使用量“U可以从codebox / meta文件中获取该信息”

memory required = no. 需要内存=否。 of column * no. 列*没有。 of rows * 8 bytes/numeric 行* 8字节/数字

so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required 例如,如果您有1,500,00行和120列,则需要超过1.34 GB的备用内存

U also can apply the same approach on other types of data with attention to number of bytes used to store different data types. U还可以对其他类型的数据应用相同的方法,同时注意用于存储不同数据类型的字节数。

If your data's stored in a csv file, you could first read in a subset of the file and calculate the memory usage in bytes with the object.size function. 如果您的数据存储在csv文件中,您可以先读入文件的子集,然后使用object.size函数计算内存使用量(以字节为单位)。 Then, you could compute the total number of lines in the file with the wc command-line utility and use the line count to scale the memory usage of your subset to get an estimate of the total usage: 然后,您可以使用wc命令行实用程序计算文件中的总行数,并使用行计数来缩放子集的内存使用情况,以估算总使用情况:

top.size <- object.size(read.csv("simulations.csv", nrow=1000))
lines <- as.numeric(gsub("[^0-9]", "", system("wc -l simulations.csv", intern=T)))
size.estimate <- lines / 1000 * top.size

Presumably there's some object overhead, so I would expect size.estimate to be an overestimate of the total memory usage when you load the whole csv file; 可能有一些对象开销,所以我希望size.estimate是一个高估了加载整个csv文件时的总内存使用量; this effect will be diminished if you use more lines to compute top.size . 如果使用更多行来计算top.size则此效果将会减弱。 Of course, this approach could be inaccurate if the first 1000 lines of your file are not representative of the overall file contents. 当然,如果文件的前1000行不能代表整个文件内容,则此方法可能不准确。

R has the function object.size(), that provides an estimate of the memory that is being used to store an R object. R具有函数object.size(),它提供了用于存储R对象的内存的估计值。 You can use like this: 你可以像这样使用:

  predict_data_size <- function(numeric_size, number_type = "numeric") {
  if(number_type == "integer") {
    byte_per_number = 4
  } else if(number_type == "numeric") {
    byte_per_number = 8 #[ 8 bytes por numero]
  } else {
    stop(sprintf("Unknown number_type: %s", number_type))
  }
  estimate_size_in_bytes = (numeric_size * byte_per_number)
  class(estimate_size_in_bytes) = "object_size"
  print(estimate_size_in_bytes, units = "auto")
}
# Example
# Matrix (rows=2000000, cols=100)
predict_data_size(2000000*100, "numeric") # 1.5 Gb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM