简体   繁体   中英

R: subset from a matrix only those rows with a certain value in a certain column

I have a large matrix "dt" of emergency department visits over 2 months for a set of diagnosis codes. The columns are "age", "sex", "date", "county", "zip", "subjectid", "position", "diag", and "dt"; the dimensions are 872344 by 9.

I want to subset from this matrix and make a new matrix containing only those rows for which the "diag" column has a number between 800 and 849 (all columns).

I have been messing with building a loop and using "which" or "if.else" but I'm running into a mental block. It seems it would be easier if it was just ONE diag code that I wanted to pull out, but the series of 50 codes complicates things... pointing to a loop? Does anyone have ideas for how to subset based on finding certain values?

Here's my start (it didn't work):

dta = dt
b = 800:849
for (i in 1:length(b)) {

}
dta = dt[dt[, 8] >= 800 & dt[, 8] <= 849, ]

ETA: Are you sure this is a matrix and not a data.frame? If it is a data.frame, you can do:

dta = dt[dt$diag >= 800 & dt$diag <= 849, ]

Given your column names, I suspect your dt is a data.frame, not a matrix; something you can confirm by running is.data.frame(dt) .

If it is the case, an easy way to filter your data is to use the subset function as follows:

dta <- subset(dt, diag >= 800 & diag <= 849)

In addition to excellent answers above, I can add filter function in dpylr package

filter(dt,diag>=800 & diag <= 849)

filter() is similar to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). dpylr package also has other nice data manipulating functions which you can have a look.

I would not convert the matrix() to a data.frame() as it is slower and incurs greater memory usage, while matrix() operations are generally faster anyway.

In addition to David's answer using column number indexing:

dta = dt[dt[,8] >= 800 & dt[,8] <= 849,]

There is also the form using column name indexing with a matrix:

dta = dt[dt[,'metric'] >= 800 & dt[,'metric'] <= 849,]

As shown by the microbenchmark package command for an identical matrix with 12 columns and 13,241 rows, run with R compiled with Intel MKL optimization:

microbenchmark::microbenchmark(
     test.matrix     = mt[mt[,3] %in% 5:10 & mt[,5] == 1,],
     test.data.frame = df[df[,3] %in% 5:10 & df[,5] == 1,],
     times = 1000
     )

Unit: microseconds
            expr      min       lq     mean  median        uq        max neval
 test.matrix      885.732  938.386 1154.898  943.74  952.4415 138215.318  1000
 test.data.frame 1176.218 1245.826 1363.379 1258.32 1286.4320 3392.556    1000

When the matrices get very large, this difference becomes tangible. On my machine, matrix indexing speeds outperform those of data.table as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM