简体   繁体   中英

Searching a matrix for only certain records

Let me start by saying I am rather new to R and generally consider myself to be a novice programmer...so don't assume I know what I'm doing :)

I have a large matrix, approximately 300,000 x 14. It's essentially a 20-year dataset of 15-minute data. However, I only need the rows where the column I've named REC.TYPE contains the string "SAO " or "FL-15".

My horribly inefficient solution was to search the matrix row by row, test the REC.TYPE column and essentially delete the row if it did not match my criteria. Essentially...

   j <- 1
   for (i in 1:nrow(dataset)) {
      if(dataset$REC.TYPE[j] != "SAO  " && dataset$RECTYPE[j] != "FL-15") {
        dataset <- dataset[-j,]  }
      else {
        j <- j+1  }
   }

After watching my code get through only about 10% of the matrix in an hour and slowing with every row...I figure there must be a more efficient way of pulling out only the records I need...especially when I need to repeat this for another 8 datasets.

Can anyone point me in the right direction?

You want regular expressions. They are case sensitive (as demonstrated below).

x <- c("ABC", "omgSAOinside", "TRALAsaoLA", "tumtiFL-15", "fl-15", "SAOFL-15")
grepl("SAO|FL-15", x)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

In your case, I would do

subsao <- grepl("SAO", x = dataset$REC.TYPE)
subfl <- grepl("FL-15", x = dataset$RECTYPE)
#mysubset <- subsao & subfl # will return TRUE only if SAO & FL-15 occur in the same line
mysubset <- subsao | subfl # will return TRUE if either occurs in the same line
dataset[mysubset, ]

I couldn't tell from the code you posted but if your data is already in a data.frame, you can do this directly. If not, first run dataset <- data.frame(dataset) .

From there:

dataset[dataset$REC.TYPE == "SAO  " | dataset$RECTYPE == "FL-15",]

should return what you're looking for. For loops are horribly inefficient in R. Once you've read through the R tutorial, the R inferno will tell you how to avoid some common pitfalls.

The way this particular line works is to filter the data frame, by only returning rows that match the criteria. You can type ?[ into your R interpeter for more information.

As other posters have said, repeating the subset [ operation is slow. Instead, functions that operate over the entire vector are preferable.

I assume that both your criteria affect REC.TYPE . My solution uses the function %in% :

dataset <- dataset[dataset$REC.TYPE %in% c("SAO","FL-15"),]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM