简体   繁体   English

在 df2 中识别 df1 中的元素,然后在 df2 中使用 R 重合的那些行中添加列

[英]Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R

I have a dataframe with two columns (genome) and a dataframe with one column (list_SSNP).我有一个带有两列(基因组)的 dataframe 和一个带有一列(list_SSNP)的 dataframe。

What I am trying to do is to add a third and fourth columns in my Genome dataframe and add the value "1" for those positions in Genome that appear in list_SSNP and, separately, in list_SCPG.我要做的是在我的基因组 dataframe 中添加第三和第四列,并为出现在 list_SSNP 和 list_SCPG 中的基因组中的那些位置添加值“1”。

I am trying to get an output dataframe that looks like this:我正在尝试获得一个看起来像这样的 output dataframe:

Gene_Symbol       CHR        SNP     
A1BG             19q13.43             
PDE1C            12p13.31     1        

This is part of the content of Genome and I have included a reproducible example:这是基因组内容的一部分,我提供了一个可重复的示例:

Genome <- c()
Genome$Gene_Symbol <- c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C")     
Genome$CHR <- c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31")
Gene_Symbol CHR
        1   A1BG        19q13.43
        2   A1BG-AS1    19q13.43
        3   A1CF        10q11.23
        4   A2M         12p13.31
        5   PDE1C       12p13.31

And this is part of the content of list_SSNP:这是list_SSNP内容的一部分:

list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
    Gene_Symbol
1   PDE1C
2   IMMP2L
3   ZCCHC14
4   NOS1AP
5   HARBI1

Using only 1 of the dataframes (list_SSNP), which is what I am attempting to do first, what I have tried to do is a loop through the genome dataframe and for element i (row) in my Genome if the element i of my list_SSNP dataframe is like element i in my Genome dataframe, then add the number 1 to a third column, but when I execute this code, nothing happens.仅使用 1 个数据帧(list_SSNP),这是我首先尝试做的,我尝试做的是循环通过基因组 dataframe 和我的基因组中的元素 i(行),如果我的 list_SSNP 的元素 i dataframe 就像我的基因组 dataframe 中的元素 i,然后将数字 1 添加到第三列,但是当我执行此代码时,没有任何反应。

Full_genome <- read.table("FULL_GENOME.txt", header=TRUE, sep = "\t", dec = ',', na.strings=c("","NA"), fill=TRUE)
Genome <- Full_genome[,c(2,3)]
names(Genome) <- c("Gene_Symbol", "CHR")

list_SSNP <- as.data.frame(Gene_SSNP$Gene_Symbol)

for (i in 1: dim (Genome) [1]) {
  if(list_SSNP[i] %in% Genome[i,1]){
    Genome[i,3] <- 1 
  }
}

Just to further clarify, I have checked that all the elements from list_SSNP appear in Genome, so it is absolutely certain that it is not a matter of not finding any coincidences.为了进一步澄清,我已经检查了 list_SSNP 中的所有元素都出现在 Genome 中,因此绝对可以肯定这不是找不到任何巧合的问题。

EDIT:编辑:

I have come to realize that my example does not specify that the entries in list_SSNP and Genome are unique and have no duplicates and that Genome has about 30k lines of entries, while list_SSNP has 49. I just want to add a column in Genome and a number 1 in those rows where the entry exists in both Genome and list_SSNP.我开始意识到我的示例没有指定 list_SSNP 和 Genome 中的条目是唯一的并且没有重复项,并且 Genome 有大约 30k 行条目,而 list_SSNP 有 49 行。我只想在 Genome 中添加一个列和一个在 Genome 和 list_SSNP 中都存在该条目的那些行中编号为 1。

I believe this could help.我相信这会有所帮助。 You can try this code:你可以试试这段代码:

#Data
Genome <- data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
                     CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
                     stringsAsFactors = F)
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
#Collapse
vecc <- paste0(list_SSNP,collapse = '|')
#Contrast
Genome$SNP <- as.numeric(grepl(pattern = vecc,x = Genome$Gene_Symbol))

Output: Output:

  Gene_Symbol      CHR SNP
1        A1BG 19q13.43   0
2    A1BG-AS1 19q13.43   0
3        A1CF 10q11.23   0
4         A2M 12p13.31   0
5       PDE1C 12p13.31   1

I may miss something important here, but the problem is formulated quite specifically to its domain.我可能在这里错过了一些重要的东西,但这个问题是针对它的领域专门制定的。 So, when I abtsracted it, I may have overseen an issue with my proposed solultion.所以,当我抽象它时,我可能已经监督了我提出的解决方案的一个问题。

However, I understand that list_SSNP can have a SNP entry multiple times.但是,我知道 list_SSNP 可以有一个 SNP 条目多次。 So first of all, you could create a list of unique SNPs with the count of its occurences因此,首先,您可以创建一个包含其出现次数的唯一 SNP 列表

library(dplyr)

list_SSNP = data.frame(SNP = c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1"))
unique_SSNP = list_SSNP %>% 
    group_by(SNP) %>% 
    # the summarize() could be replaced by count I guess, but I usually use this for more control
    summarize(count = n()) 

And now you use a left_join现在你使用 left_join

Genome = data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
                     CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
                     stringsAsFactors = F)

Genome_extended = Genome %>% 
    left_join(unique_SSNP, by = c("Gene_Symbol" = "SNP"))

The count column in the extended dataframe would be NAs for non-existing SNPs and you could fill the NAs with a variety of commands from dplyr, tidyr or even base R.扩展 dataframe 中的计数列将是不存在 SNP 的 NA,您可以使用来自 dplyr、tidyr 甚至基本 ZE1E1D3D400573127E9ZEE0288 的各种命令填充 NA。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM