簡體   English   中英

將一個 dataframe 的每一列與另一個 dataframe 列進行比較,並將每個結果重疊打印到單獨的文件中

[英]Compare each column of one dataframe with another dataframe column and print each resulting overlap to separate files

我想將一個 dataframe 的每一列與另一個 dataframe 列進行比較,並將每個結果重疊打印到單獨的文件中。

我從兩個測試數據集開始:

df1 <- data.frame("x" = c("a_b", "c_d", "e_f/c_f", "g_h"),
                  "y" = c(9,2,1,4),
                  "z" = c(7,5,8,5))
df2 <- data.frame("m" = c("c_f", "x_y"),
                  "n" = c("a_b", "x_y"))

並使用 for 循環來獲取結果。

for (i in colnames(df2)){ 
  ccc<-df1[grep(paste(df2[,i], collapse = "|"), df1$x), ]
  write.csv(ccc, file = paste(i, ".csv", sep=""))
}

一切看起來都很好。

現在我在我的完整數據集中嘗試相同的循環(下面是修改后的 df1 和 df2):

df1<- structure(list(BGC_Accession = structure(c(1L, 1L, 1L, 2L), .Label = c("BGC0000647", 
"BGC0000984"), class = "factor"), Genbank_ID = structure(c(1L, 
3L, 2L, 4L), .Label = c("GCA_000202835", "GCA_000219295", "GCA_000964345", 
"GCA_003029685"), class = "factor"), BGC_Class = structure(c(2L, 
2L, 2L, 1L), .Label = c("NRP/Polyketide", "Terpene"), class = "factor"), 
    BGC_Start = c(2093957L, 1L, 1L, 2656134L), BGC_End = c(2115021L, 
    4440L, 4186L, 2721658L), Product = structure(c(1L, 1L, 1L, 
    2L), .Label = c("Carotenoid", "Delftibactin"), class = "factor"), 
    Similarity = structure(c(1L, 1L, 1L, 1L), .Label = "100%", class = "factor"), 
    Species_name = structure(c(1L, 4L, 2L, 3L), .Label = c("Acidiphilium_multivorum", 
    "Acidiphilium_sp_PM", "Acidovorax_avenae/Acidovorax_avene", 
    "Acinetobacter_baumannii"), class = "factor"), Kingdom = structure(c(1L, 
    1L, 1L, 1L), .Label = "k__Bacteria", class = "factor"), Phylum = structure(c(1L, 
    1L, 1L, 1L), .Label = "p__Proteobacteria", class = "factor"), 
    Class = structure(c(1L, 1L, 1L, 2L), .Label = c("c__Alphaproteobacteria", 
    "c__Betaproteobacteria"), class = "factor"), Order = structure(c(2L, 
    2L, 2L, 1L), .Label = c("o__Burkholderiales", "o__Rhodospirillales"
    ), class = "factor"), Family = structure(c(1L, 1L, 1L, 2L
    ), .Label = c("f__Acetobacteraceae", "f__Comamonadaceae"), class = "factor"), 
    Genus = structure(c(1L, 1L, 1L, 2L), .Label = c("g__Acidiphilium", 
    "g__Acidovorax"), class = "factor"), Species = structure(c(1L, 
    1L, 2L, 3L), .Label = c("s__Acidiphilium_multivorum", "s__Acidiphilium_sp_PM", 
    "s__Acidovorax_avenae"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))
df2<- structure(list(Gut_SRS011111 = structure(c(2L, 1L, 1L), .Label = c("", 
"Actinobaculum_unclassified"), class = "factor"), Gut_SRS011269 = structure(c(3L, 
1L, 2L), .Label = c("Acidiphilium_multivorum", "Acinetobacter_baumannii", 
"Clostridium_citroniae"), class = "factor"), Gut_SRS011355 = structure(c(2L, 
3L, 1L), .Label = c("", "Acidovorax_avene", "Streptococcus_gordonii"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-3L))

使用上面的腳本:

for (i in colnames(df2)){ 
  overlap_data<-df1[grep(paste(df2[,i], collapse = "|"), df1$Species_name), ]
  write.csv(overlap_data, file = paste(i, ".csv", sep=""))
}

似乎只有三個重疊列中的一個(在 df2 中)給出了正確的結果。 例如,在 df2 的第一列中,與 df1 沒有重疊,它應該給出一個空白的結果文件。 第二列 output 文件看起來沒問題。 在第三個文件中,我應該得到一個重疊,而不是 output 文件中給出的四個重疊。

我究竟做錯了什么?

謝謝你的耐心!

問題似乎是空的""單元格,應該是NA

df2[df2 == ""] <- NA

現在, grep應該可以工作了。 我在這里使用lapply而不是for循環:

invisible(lapply(names(df2), function(x) {
  rr <- df1[grep(paste0(df2[,x], collapse= "|"), df1$Species_name), ]
  write.csv(rr, file = paste(x, ".csv", sep=""))
}))

invisible避免了對控制台不必要和無聊的output,也可以省略。)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM