仅搜索某些记录的矩阵

Question

Let me start by saying I am rather new to R and generally consider myself to be a novice programmer...so don't assume I know what I'm doing :) 首先让我说我对R很新，并且通常认为自己是一名新手程序员...所以不要以为我知道我在做什么:)

I have a large matrix, approximately 300,000 x 14. It's essentially a 20-year dataset of 15-minute data. 我有一个大矩阵，大约300,000 x 14.它基本上是一个20年的15分钟数据集。 However, I only need the rows where the column I've named REC.TYPE contains the string "SAO " or "FL-15". 但是，我只需要我名为REC.TYPE的列包含字符串“SAO”或“FL-15”的行。

My horribly inefficient solution was to search the matrix row by row, test the REC.TYPE column and essentially delete the row if it did not match my criteria. 我非常低效的解决方案是逐行搜索矩阵，测试REC.TYPE列，如果它与我的标准不匹配，基本上删除该行。 Essentially... 实质上...

   j <- 1
   for (i in 1:nrow(dataset)) {
      if(dataset$REC.TYPE[j] != "SAO  " && dataset$RECTYPE[j] != "FL-15") {
        dataset <- dataset[-j,]  }
      else {
        j <- j+1  }
   }

After watching my code get through only about 10% of the matrix in an hour and slowing with every row...I figure there must be a more efficient way of pulling out only the records I need...especially when I need to repeat this for another 8 datasets. 看完我的代码后，我的代码在一小时内只能通过大约10％的矩阵，并且每行减速......我认为必须有一种更有效的方式来拉出我需要的记录......特别是当我需要重复时这适用于另外8个数据集。

Can anyone point me in the right direction? 谁能指出我正确的方向？

Answer 1

You want regular expressions. 你想要正则表达式。 They are case sensitive (as demonstrated below). 它们区分大小写（如下所示）。

x <- c("ABC", "omgSAOinside", "TRALAsaoLA", "tumtiFL-15", "fl-15", "SAOFL-15")
grepl("SAO|FL-15", x)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

In your case, I would do 在你的情况下，我会这样做

subsao <- grepl("SAO", x = dataset$REC.TYPE)
subfl <- grepl("FL-15", x = dataset$RECTYPE)
#mysubset <- subsao & subfl # will return TRUE only if SAO & FL-15 occur in the same line
mysubset <- subsao | subfl # will return TRUE if either occurs in the same line
dataset[mysubset, ]

Answer 2

I couldn't tell from the code you posted but if your data is already in a data.frame, you can do this directly. 我无法从您发布的代码中看出，但如果您的数据已经存在于data.frame中，则可以直接执行此操作。 If not, first run dataset <- data.frame(dataset) . 如果没有，首先运行dataset <- data.frame(dataset) 。

From there: 从那里：

dataset[dataset$REC.TYPE == "SAO  " | dataset$RECTYPE == "FL-15",]

should return what you're looking for. 应该返回你想要的东西。 For loops are horribly inefficient in R. Once you've read through the R tutorial, the R inferno will tell you how to avoid some common pitfalls. For循环在R中非常低效。一旦你阅读了R教程， R inferno就会告诉你如何避免一些常见的陷阱。

The way this particular line works is to filter the data frame, by only returning rows that match the criteria. 此特定行的工作方式是过滤数据框，只返回符合条件的行。 You can type ?[ into your R interpeter for more information. 您可以输入?[到您的R interpeter中以获取更多信息。

Answer 3

As other posters have said, repeating the subset [ operation is slow. 正如其他海报所说，重复子集[操作很慢。 Instead, functions that operate over the entire vector are preferable. 相反，优选在整个矢量上运行的函数。

I assume that both your criteria affect REC.TYPE . 我假设您的两个条件都会影响REC.TYPE 。 My solution uses the function %in% : 我的解决方案使用%in%函数：

dataset <- dataset[dataset$REC.TYPE %in% c("SAO","FL-15"),]

仅搜索某些记录的矩阵

问题描述

3 个解决方案

解决方案1
4 2013-03-03 06:25:32

解决方案2
4 2013-03-03 06:26:23

解决方案3
3 2013-03-03 06:42:24

仅搜索某些记录的矩阵

问题描述

3 个解决方案

解决方案1 4 2013-03-03 06:25:32

解决方案2 4 2013-03-03 06:26:23

解决方案3 3 2013-03-03 06:42:24

解决方案1
4 2013-03-03 06:25:32

解决方案2
4 2013-03-03 06:26:23

解决方案3
3 2013-03-03 06:42:24