简体   繁体   English

R数据结构的操作效率

[英]Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R , specifically those related to data manipulation. 我想知道是否有关于R中操作效率的任何文档,特别是与数据操作相关的文档。

For example: 例如:

  • I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list. 我想将列添加到数据框是有效的,因为我猜你只是在链表中添加一个元素。
  • I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over. 我想添加行的速度较慢,因为向量在C level数组中保存,你必须分配一个长度为n+1的新数组并复制所有元素。

The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on. 开发人员可能不希望将自己绑定到特定的实现,但是有一些比猜测还要坚实的东西会更好。

Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops . 另外,我知道主要的R性能提示是尽可能使用向量操作而不是loops

  • what about the various flavors of apply ? apply的各种口味怎么样?
  • are those just hidden loops ? 那些只是hidden loops
  • what about matrices vs. data frames ? 那么matricesdata frames呢?

Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues: 在我致力于学习R之前,数据IO是我所研究的功能之一。无论好坏,这里是我对这些问题的观察和解决方案/缓解:

1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. 1.R不处理大数据 (> 2 GB?)对我来说这是用词不当。 By default, the common data input functions load your data into RAM. 默认情况下,公共数据输入功能会将数据加载到RAM中。 Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. 不要愚蠢,但对我来说,这是一个功能而不是一个错误 - 任何时候我的数据都适合我的可用内存,这就是我想要的地方。 Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. 同样,SQLite最受欢迎的功能之一是内存中选项 - 用户可以轻松选择将整个dB加载到RAM中。 If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff ). 如果你的数据不适合内存,那么通过连接到常见的RDBMS系统(RODBC,RSQLite,RMySQL等),通过像filehash包这样的简单选项,通过R,可以非常容易地保存它。当前技术/实践的系统(例如,我可以推荐ff )。 In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out. 换句话说,R开发人员选择了一个明智的(也可能是最佳的)默认值,从中很容易选择退出。

2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). 2. read.table (read.csv,read.delim等)的性能,这是将数据输入R的最常用方法,只需选择退出就可以提高5倍(通常我的经验更多)一些read.table的默认参数 - 对性能影响最大的参数在R的帮助(?read.table)中提到。 Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. 简而言之,R开发人员告诉我们,如果您为参数'colClasses','nrows','sep'和'comment.char'提供值(特别是,如果您知道您的文件以标题开头,则传入''第1行的数据,你会看到显着的性能提升。 I've found that to be true. 我发现这是真的。

Here are the snippets i use for those parameters: 以下是我用于这些参数的片段:

To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table): 要获取数据文件中的行数(在调用read.table时将此代码段作为参数的参数提供,'nrows'):

as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))

To get the classes for each column: 获取每列的类:

function(fname){sapply(read.table(fname, header=T, nrows=5), class)}  

Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table: 注意:您不能将此片段作为参数传递,您必须先调用它,然后传入返回的值 - 换句话说,调用函数,将返回的值绑定到变量,然后传入变量作为read.table调用中参数'colClasses'的值:

3. Using Scan . 3.使用扫描 With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). 只需稍微麻烦一点,你可以做得更好(优化'read.table')使用'scan'而不是'read.table'('read.table'实际上只是'scan'的包装)。 Once again, this is very easy to do. 再一次,这很容易做到。 I use 'scan' to input each column individually then build my data.frame inside R, ie, df = data.frame(cbind(col1, col2,....)). 我使用'scan'分别输入每一列然后在R中构建我的data.frame,即df = data.frame(cbind(col1,col2,....))。

4. Use R's Containers for persistence in place of ordinary file formats (eg, 'txt', 'csv'). 4.使用R的容器来代替普通文件格式的持久性(例如,'txt','csv')。 R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. R的原生数据文件'.RData'是一种二进制格式,比压缩('.gz')txt数据文件略小。 You create them using save (, ). 您可以使用save (,)创建它们。 You load it back into the R namespace with load (). 使用load ()将其加载回R命名空间。 The difference in load times compared with 'read.table' is dramatic. 与“read.table”相比,加载时间的差异是巨大的。 For instance, w/ a 25 MB file (uncompressed size) 例如,带有25 MB文件(未压缩的大小)

system.time(read.table("tdata01.txt.gz", sep=","))
=>  user  system elapsed 
    6.173   0.245   **6.450** 

system.time(load("tdata01.RData"))
=> user  system elapsed 
    0.912   0.006   **0.912**   

5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. 5.注意数据类型通常可以提高性能并减少内存占用。 This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, eg, > typeof(5) returns "double." 这一点在从R中获取数据时可能更有用。这里要记住的关键点是默认情况下,R表达式中的数字被解释为双精度浮点,例如> typeof(5)返回“double”。 “ Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). 比较每个的合理大小的数组的对象大小,您可以看到重要性(使用object.size())。 So coerce to integer when you can. 所以当你可以时强制转换为整数。

Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. 最后,'apply'系列函数(以及其他函数)不是“隐藏循环”或循环包装器。 They are loops implemented in C--big difference performance-wise. 它们是用C实现的循环 - 性能差异很大。 [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function. [编辑:AWB正确地指出,虽然'sapply','tapply'和'mapply'是用C实现的,'apply'只是一个包装函数。

These things do pop up on the lists, in particular on r-devel. 这些东西会在列表中弹出,特别是在r-devel上。 One fairly well-established nugget is that eg matrix operations tend to be faster than data.frame operations. 一个相当成熟的金块是例如matrix操作往往比data.frame操作更快。 Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick. 然后有一些附加的包很好--Matdata.table包非常快,并且Jeff已经快速获得了xts索引。

But it "all depends" -- so you are usually best adviced to profile on your particular code . 但它“完全取决于” - 所以你通常最好建议你的特定代码 R has plenty of profiling support, so you should use it. R有很多分析支持,所以你应该使用它。 My Intro to HPC with R tutorials have a number of profiling examples. 我的带有R教程的HPC简介有很多分析示例。

I will try to come back and provide more detail. 我会尝试回来提供更多细节。 If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). 如果您对一项操作相对于另一项操作的效率有任何疑问,您最好对自己的代码进行分析(如Dirk建议的那样)。 The system.time() function is the easiest way to do this although there are many more advanced utilities (eg Rprof, as documented here ). 尽管有更多高级实用程序(例如Rprof,如此处所述 ),但system.time()函数是最简单的方法。

A quick response for the second part of your question: 对问题第二部分的快速回复:

What about the various flavors of apply? 申请的各种风味怎么样? Are those just hidden loops? 那些只是隐藏的循环吗?

For the most part yes, the apply functions are just loops and can be slower than for statements. 在大多数情况下,是的,在应用功能都只是循环,可以是慢for语句。 Their chief benefit is clearer code. 他们的主要好处是更清晰的代码。 The main exception that I have found is lapply which can be faster because it is coded in C directly. 我发现的主要例外是lapply ,它可以更快,因为它直接用C编码。

And what about matrices vs. data frames? 那么矩阵与数据帧呢?

Matrices are more efficient than data frames because they require less memory for storage. 矩阵比数据帧更有效,因为它们需要更少的内存来存储。 This is because data frames require additional attribute data. 这是因为数据帧需要额外的属性数据。 From R Introduction : 来自R简介

A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes 出于许多目的,数据帧可以被视为具有可能具有不同模式和属性的列的矩阵

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM