简体   繁体   English

R:来自矩阵的子集,只有在特定列中具有特定值的那些行

[英]R: subset from a matrix only those rows with a certain value in a certain column

I have a large matrix "dt" of emergency department visits over 2 months for a set of diagnosis codes.对于一组诊断代码,我有一个大型矩阵“dt”,其中包含超过 2 个月的急诊科就诊次数。 The columns are "age", "sex", "date", "county", "zip", "subjectid", "position", "diag", and "dt";列是“年龄”、“性别”、“日期”、“县”、“邮编”、“主题ID”、“位置”、“诊断”和“dt”; the dimensions are 872344 by 9.尺寸为 872344 x 9。

I want to subset from this matrix and make a new matrix containing only those rows for which the "diag" column has a number between 800 and 849 (all columns).我想从这个矩阵中提取子集并创建一个新矩阵,其中只包含“diag”列的数字在 800 到 849(所有列)之间的那些行。

I have been messing with building a loop and using "which" or "if.else" but I'm running into a mental block.我一直在忙于构建循环并使用“which”或“if.else”,但我遇到了心理障碍。 It seems it would be easier if it was just ONE diag code that I wanted to pull out, but the series of 50 codes complicates things... pointing to a loop?如果我只想提取一个诊断代码似乎会更容易,但是这一系列 50 个代码使事情复杂化......指向一个循环? Does anyone have ideas for how to subset based on finding certain values?有没有人有关于如何根据找到某些值进行子集化的想法?

Here's my start (it didn't work):这是我的开始(它不起作用):

dta = dt
b = 800:849
for (i in 1:length(b)) {

}
dta = dt[dt[, 8] >= 800 & dt[, 8] <= 849, ]

ETA: Are you sure this is a matrix and not a data.frame? ETA:你确定这是一个矩阵而不是一个 data.frame? If it is a data.frame, you can do:如果它是一个data.frame,你可以这样做:

dta = dt[dt$diag >= 800 & dt$diag <= 849, ]

Given your column names, I suspect your dt is a data.frame, not a matrix;鉴于您的列名,我怀疑您的dt是一个 data.frame,而不是一个矩阵; something you can confirm by running is.data.frame(dt) .您可以通过运行is.data.frame(dt)来确认。

If it is the case, an easy way to filter your data is to use the subset function as follows:如果是这种情况,过滤数据的一种简单方法是使用subset函数,如下所示:

dta <- subset(dt, diag >= 800 & diag <= 849)

In addition to excellent answers above, I can add filter function in dpylr package除了上面的优秀答案,我还可以在dpylr包中添加filter功能

filter(dt,diag>=800 & diag <= 849)

filter() is similar to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). filter()subset()类似,不同之处在于你可以给它提供任意数量的过滤条件,这些条件用&连接在一起(不是&& ,这很容易意外地做到!)。 dpylr package also has other nice data manipulating functions which you can have a look. dpylr包还有其他不错的数据操作功能,你可以看看。

I would not convert the matrix() to a data.frame() as it is slower and incurs greater memory usage, while matrix() operations are generally faster anyway.我不会将matrix()转换为data.frame()因为它更慢并且会导致更大的内存使用量,而matrix()操作通常更快。

In addition to David's answer using column number indexing:除了大卫使用列号索引的答案:

dta = dt[dt[,8] >= 800 & dt[,8] <= 849,]

There is also the form using column name indexing with a matrix:还有使用列名索引和矩阵的形式:

dta = dt[dt[,'metric'] >= 800 & dt[,'metric'] <= 849,]

As shown by the microbenchmark package command for an identical matrix with 12 columns and 13,241 rows, run with R compiled with Intel MKL optimization:如针对具有 12 列和 13,241 行的相同矩阵的microbenchmark包命令所示,使用通过英特尔 MKL 优化编译的 R 运行:

microbenchmark::microbenchmark(
     test.matrix     = mt[mt[,3] %in% 5:10 & mt[,5] == 1,],
     test.data.frame = df[df[,3] %in% 5:10 & df[,5] == 1,],
     times = 1000
     )

Unit: microseconds
            expr      min       lq     mean  median        uq        max neval
 test.matrix      885.732  938.386 1154.898  943.74  952.4415 138215.318  1000
 test.data.frame 1176.218 1245.826 1363.379 1258.32 1286.4320 3392.556    1000

When the matrices get very large, this difference becomes tangible.当矩阵变得非常大时,这种差异变得明显。 On my machine, matrix indexing speeds outperform those of data.table as well.在我的机器上,矩阵索引速度也优于data.table

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM