简体   繁体   English

R:通过计算来自另一个数据帧的 CSV 列中字符串的出现,将计数出现列添加到数据帧

[英]R: add a count occurrence column to dataframe by counting the occurrence of a string in a CSV column from another dataframe

I have a dataframe df1 :我有一个数据df1

df1 <- structure(list(Id = c(0, 1, 3, 4), Support = c(17, 15, 10, 18
), Genes = structure(c(3L, 1L, 4L, 2L), .Label = c("BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1", 
"CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4", "FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B", 
"MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

and another dataframe df2 :和另一个数据帧df2

df2 <- structure(list(V1 = structure(c(6L, 7L, 8L, 4L, 3L, 1L, 5L, 2L, 
9L), .Label = c("BCL2", "BMP3", "CBLC", "CDC23", "CITED1", "FOS", 
"MAPK13", "SPRY4", "TGFA"), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

How can I create a new column in df1 by counting the occurrence of each string from df2 in Genes column to achieve such desired output ?如何通过计算Genes列中df2中每个字符串的出现次数来在df1创建新列以实现所需的输出?

    Id    |    Support    |     Genes    |    Counts    |
---------------------------------------------------------
    0     |       17      |FOS,BCL2,...  |      2       |
    1     |       15      |BMP2,TFGB1,...|      3       |
    3     |       10      |MAPK12,YWHAE..|      1       |
    4     |       18      |CBLC,TGFA,... |      4       | 

There is probably a more elegant solution, but this does the job.可能有一个更优雅的解决方案,但这可以完成工作。

df$Counts <- unlist(lapply(df$Genes, function(x){
  xx <- unlist(strsplit(as.character(x),split = ","))
  sum(df2$V1 %in% xx)
}))

Which gives:这使:

 Id Support                                      Genes Counts
1  0      17           FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B      2
2  1      15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1      3
3  3      10     MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD      2
4  4      18  CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4      4

(I think in your example above Counts in the third row should be 2 not 1 ?) (我认为在你上面的例子中,第三行的Counts应该是2而不是1 ?)

Here is another option using the stringr library.这是使用 stringr 库的另一个选项。 This loops over the Genes column from df and uses the df2 data frame as the pattern.这将循环来自 df 的 Genes 列,并使用 df2 数据框作为模式。

#convert factors columns into characters
df$Genes<-as.character(df$Genes)
df2$V1<-as.character(df2$V1)

library(stringr)
#loop over the strings against the pattern from df2
df$Counts<-sapply(df$Genes, function(x){
  sum(str_count(x, df2$V1))
})



df
  Id Support                                      Genes Counts
1  0      17           FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B      2
2  1      15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1      3
3  3      10     MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD      2
4  4      18  CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4      4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用另一数据框R的行替换一列中每次出现的因子变量 - Replace every occurrence of factor variable in one column with row from another dataframe R 按出现频率对数据框列进行排序 - Sort a dataframe column by the frequency of occurrence 如何从单列创建对,计算 R 中的出现次数? - How to create pairs from a single column counting the occurrence in R? R-如何将某一列中特定事件的总和添加到另一列中 - R - How to add the sum of a specific occurrence in one column to another column R计数字符串的出现 - R counting occurrence of string 使用 group by 计算 R 中列中子字符串的出现次数 - counting the occurrence of substrings in a column in R with group by 添加新列表示 R 数据框中特定月份内工作日的出现次数 - add new column represent the number of occurrence of weekday within the specific month in R dataframe 在数据帧中添加每行最大出现的列 - Adding column of max occurrence per row in dataframe 在字符串中出现多个模式时,在 R 中过滤 dataframe - Filter dataframe in R on occurrence of multiple patterns in a string 如何在R中的数据帧中找到一列中出现字符串最长的时间以及另一列中对应的第一个和最后一个值? - How to find the longest occurrence of a string in a column and corresponding first and last values from another column in a data frame in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM