[英]R: add a count occurrence column to dataframe by counting the occurrence of a string in a CSV column from another dataframe
I have a dataframe df1
:我有一个数据df1
:
df1 <- structure(list(Id = c(0, 1, 3, 4), Support = c(17, 15, 10, 18
), Genes = structure(c(3L, 1L, 4L, 2L), .Label = c("BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1",
"CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4", "FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B",
"MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
and another dataframe df2
:和另一个数据帧df2
:
df2 <- structure(list(V1 = structure(c(6L, 7L, 8L, 4L, 3L, 1L, 5L, 2L,
9L), .Label = c("BCL2", "BMP3", "CBLC", "CDC23", "CITED1", "FOS",
"MAPK13", "SPRY4", "TGFA"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
How can I create a new column in df1
by counting the occurrence of each string from df2
in Genes
column to achieve such desired output ?如何通过计算Genes
列中df2
中每个字符串的出现次数来在df1
创建新列以实现所需的输出?
Id | Support | Genes | Counts |
---------------------------------------------------------
0 | 17 |FOS,BCL2,... | 2 |
1 | 15 |BMP2,TFGB1,...| 3 |
3 | 10 |MAPK12,YWHAE..| 1 |
4 | 18 |CBLC,TGFA,... | 4 |
There is probably a more elegant solution, but this does the job.可能有一个更优雅的解决方案,但这可以完成工作。
df$Counts <- unlist(lapply(df$Genes, function(x){
xx <- unlist(strsplit(as.character(x),split = ","))
sum(df2$V1 %in% xx)
}))
Which gives:这使:
Id Support Genes Counts
1 0 17 FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B 2
2 1 15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1 3
3 3 10 MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD 2
4 4 18 CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4 4
(I think in your example above Counts
in the third row should be 2
not 1
?) (我认为在你上面的例子中,第三行的Counts
应该是2
而不是1
?)
Here is another option using the stringr library.这是使用 stringr 库的另一个选项。 This loops over the Genes column from df and uses the df2 data frame as the pattern.这将循环来自 df 的 Genes 列,并使用 df2 数据框作为模式。
#convert factors columns into characters
df$Genes<-as.character(df$Genes)
df2$V1<-as.character(df2$V1)
library(stringr)
#loop over the strings against the pattern from df2
df$Counts<-sapply(df$Genes, function(x){
sum(str_count(x, df2$V1))
})
df
Id Support Genes Counts
1 0 17 FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B 2
2 1 15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1 3
3 3 10 MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD 2
4 4 18 CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.