[英]Predict memory usage in R
I have downloaded a huge file from the UCI Machine learning Dataset library. 我从UCI机器学习数据集库下载了一个巨大的文件。 (~300mb).
(〜300MB)。
Is there a way to predict the memory required to load the dataset, before loading it into R memory? 在将数据集加载到R内存之前,有没有办法预测加载数据集所需的内存?
Googled a lot, but everywhere all I could find is how to calculate memory with R-profiler and several other packages, but after loading the objects into R. 谷歌搜索了很多,但我发现的所有地方都是如何使用R-profiler和其他几个包计算内存,但是在将对象加载到R之后。
based on "R programming" coursera course, U can calculate the proximate memory usage using number of rows and columns within the data" U can get that info from the codebox/meta file" 基于“R编程”课程,U可以使用数据中的行数和列数来计算近似内存使用量“U可以从codebox / meta文件中获取该信息”
memory required = no. 需要内存=否。 of column * no.
列*没有。 of rows * 8 bytes/numeric
行* 8字节/数字
so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required 例如,如果您有1,500,00行和120列,则需要超过1.34 GB的备用内存
U also can apply the same approach on other types of data with attention to number of bytes used to store different data types. U还可以对其他类型的数据应用相同的方法,同时注意用于存储不同数据类型的字节数。
If your data's stored in a csv file, you could first read in a subset of the file and calculate the memory usage in bytes with the object.size
function. 如果您的数据存储在csv文件中,您可以先读入文件的子集,然后使用
object.size
函数计算内存使用量(以字节为单位)。 Then, you could compute the total number of lines in the file with the wc
command-line utility and use the line count to scale the memory usage of your subset to get an estimate of the total usage: 然后,您可以使用
wc
命令行实用程序计算文件中的总行数,并使用行计数来缩放子集的内存使用情况,以估算总使用情况:
top.size <- object.size(read.csv("simulations.csv", nrow=1000))
lines <- as.numeric(gsub("[^0-9]", "", system("wc -l simulations.csv", intern=T)))
size.estimate <- lines / 1000 * top.size
Presumably there's some object overhead, so I would expect size.estimate
to be an overestimate of the total memory usage when you load the whole csv file; 可能有一些对象开销,所以我希望
size.estimate
是一个高估了加载整个csv文件时的总内存使用量; this effect will be diminished if you use more lines to compute top.size
. 如果使用更多行来计算
top.size
则此效果将会减弱。 Of course, this approach could be inaccurate if the first 1000 lines of your file are not representative of the overall file contents. 当然,如果文件的前1000行不能代表整个文件内容,则此方法可能不准确。
R has the function object.size(), that provides an estimate of the memory that is being used to store an R object. R具有函数object.size(),它提供了用于存储R对象的内存的估计值。 You can use like this:
你可以像这样使用:
predict_data_size <- function(numeric_size, number_type = "numeric") {
if(number_type == "integer") {
byte_per_number = 4
} else if(number_type == "numeric") {
byte_per_number = 8 #[ 8 bytes por numero]
} else {
stop(sprintf("Unknown number_type: %s", number_type))
}
estimate_size_in_bytes = (numeric_size * byte_per_number)
class(estimate_size_in_bytes) = "object_size"
print(estimate_size_in_bytes, units = "auto")
}
# Example
# Matrix (rows=2000000, cols=100)
predict_data_size(2000000*100, "numeric") # 1.5 Gb
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.