R数据框-如何添加更多行作为子集

Question

这个问题与询问的问题类似，但是我看不出如何将其应用于我的数据。

我有一个1875行的数据框。 现在，在每一行中都有一个字段nbc ，它是一个大字符串。 然后，我在其上运行一个函数（任意，无关），该函数从nbc返回某些子字符串。 有时它将返回1个子字符串，有时返回20个子字符串。我要做的就是将此信息附加到我的数据帧中。

所以考虑一下

+----+-------+-------------+
| id |  seq  |     nbc     |
+----+-------+-------------+
|  1 | atcgg | atgccttatac |
|  2 | tatgc | tataggctata |
+----+-------+-------------+

首先，将函数应用到nbc我得到以下2个子字符串： atgc ， tatac ，这些是我感兴趣的。 我现在想将此添加到数据框，如下所示：

+----+-------+-------------+------------+
| id |  seq  |     nbc     | substrings |
+----+-------+-------------+------------+
|  1 | atcgg | atgccttatac | atgc       |
|  1 | atcgg | atgccttatac | tatac      |
|  2 | tatgc | tataggctata |            |
+----+-------+-------------+------------+

因此，该行将为找到的每个子字符串重复。

关于如何有效执行此操作的任何想法？ 我只需要pseduocode，因为我将使用foreach / parallel包对其进行并行化。

Answer 1

我将按照以下步骤进行（很难测试，因为您没有提供可重复的示例）：

 #apply myfunc to each element of nbc
 substrings<-lapply(df$nbc,myfunc)
 #get the length of each element of substrings
 lengths<-vapply(substrings,length,1L)
 #repeat each row of your data.frame as many times as the number of substrings returned by myfunc
 df<-df[rep(1:nrow(df),lengths),]
 #add the substrings column
 df$columns<-unlist(substrings)

当然，它尚未经过测试，但可能有效。

Answer 2

如果我正确理解了您的问题，并且如果您愿意使用data.table （至少作为中间步骤），则可以执行以下操作：

library(data.table)
library(stringr) 
##
foo <- function(x,y) {
  res <- unlist(str_extract_all(x,y))
  if (length(res)>0) {
    res
  } else {
    ""
  }
}
##
Dt <- data.table(Df)
##
R>  Dt[,list(substrings=foo(
    x=nbc,
    y="atgc|tatac")),
    by="id,seq,nbc"]
   id   seq         nbc substrings
1:  1 atcgg atgccttatac       atgc
2:  1 atcgg atgccttatac      tatac
3:  2 tatgc tataggctata

假设您想要的子字符串是atgc或tatac （那部分我还不太清楚）。 在3行data.frame / data.table上进行任何严格的测试data.frame data.table ，但是这种方法似乎适用于我创建的示例对象（如下），从数字和字母的随机序列中提取3个或更多数字的子字符串：

m <- replicate(
  5,
  paste(
    sample(
      c(letters[1:10],0:9),
      20,
      replace=TRUE),
    collapse=""))
m <- c(m,paste(letters[1:20],collapse=""))
##
R>  m
[1] "7j166a6b1a30hg1e8j05" "d1h6f634386ag41309i9" "egf98f8g5f60be345g3e"
[4] "7140447bjb4gj78f313d" "h1j9bij94b9dj28ed72d" "abcdefghijklmnopqrst"
##
DF <- data.frame(
  id=1:6,
  seq=sample(LETTERS,6),
  nbc=m,
  stringsAsFactors=F)
##
DT <- data.table(DF)
##
R>  DT[,list(sequences=foo(
    x=nbc,y="\\d{3,}")),
    by="id,seq,nbc"]
   id seq                  nbc sequences
1:  1   H 7j166a6b1a30hg1e8j05       166
2:  2   A d1h6f634386ag41309i9    634386
3:  2   A d1h6f634386ag41309i9     41309
4:  3   J egf98f8g5f60be345g3e       345
5:  4   G 7140447bjb4gj78f313d   7140447
6:  4   G 7140447bjb4gj78f313d       313
7:  5   C h1j9bij94b9dj28ed72d          
8:  6   L abcdefghijklmnopqrst

其中seq列在上述对象中无意义。

第一个示例的数据：

Df <- data.frame(
  id=1:2,
  seq=c("atcgg","tatgc"),
  nbc=c("atgccttatac","tataggctata"),
  stringsAsFactors=F)

R数据框-如何添加更多行作为子集

问题描述

2 个解决方案

解决方案1
0 2015-01-27 20:57:18

解决方案2
0 已采纳 2015-01-27 21:37:24

R数据框-如何添加更多行作为子集

问题描述

2 个解决方案

解决方案1 0 2015-01-27 20:57:18

解决方案2 0 已采纳 2015-01-27 21:37:24

解决方案1
0 2015-01-27 20:57:18

解决方案2
0 已采纳 2015-01-27 21:37:24