[英]Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R
I have a dataframe with two columns (genome) and a dataframe with one column (list_SSNP).我有一个带有两列(基因组)的 dataframe 和一个带有一列(list_SSNP)的 dataframe。
What I am trying to do is to add a third and fourth columns in my Genome dataframe and add the value "1" for those positions in Genome that appear in list_SSNP and, separately, in list_SCPG.我要做的是在我的基因组 dataframe 中添加第三和第四列,并为出现在 list_SSNP 和 list_SCPG 中的基因组中的那些位置添加值“1”。
I am trying to get an output dataframe that looks like this:我正在尝试获得一个看起来像这样的 output dataframe:
Gene_Symbol CHR SNP
A1BG 19q13.43
PDE1C 12p13.31 1
This is part of the content of Genome and I have included a reproducible example:这是基因组内容的一部分,我提供了一个可重复的示例:
Genome <- c()
Genome$Gene_Symbol <- c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C")
Genome$CHR <- c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31")
Gene_Symbol CHR
1 A1BG 19q13.43
2 A1BG-AS1 19q13.43
3 A1CF 10q11.23
4 A2M 12p13.31
5 PDE1C 12p13.31
And this is part of the content of list_SSNP:这是list_SSNP内容的一部分:
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
Gene_Symbol
1 PDE1C
2 IMMP2L
3 ZCCHC14
4 NOS1AP
5 HARBI1
Using only 1 of the dataframes (list_SSNP), which is what I am attempting to do first, what I have tried to do is a loop through the genome dataframe and for element i (row) in my Genome if the element i of my list_SSNP dataframe is like element i in my Genome dataframe, then add the number 1 to a third column, but when I execute this code, nothing happens.仅使用 1 个数据帧(list_SSNP),这是我首先尝试做的,我尝试做的是循环通过基因组 dataframe 和我的基因组中的元素 i(行),如果我的 list_SSNP 的元素 i dataframe 就像我的基因组 dataframe 中的元素 i,然后将数字 1 添加到第三列,但是当我执行此代码时,没有任何反应。
Full_genome <- read.table("FULL_GENOME.txt", header=TRUE, sep = "\t", dec = ',', na.strings=c("","NA"), fill=TRUE)
Genome <- Full_genome[,c(2,3)]
names(Genome) <- c("Gene_Symbol", "CHR")
list_SSNP <- as.data.frame(Gene_SSNP$Gene_Symbol)
for (i in 1: dim (Genome) [1]) {
if(list_SSNP[i] %in% Genome[i,1]){
Genome[i,3] <- 1
}
}
Just to further clarify, I have checked that all the elements from list_SSNP appear in Genome, so it is absolutely certain that it is not a matter of not finding any coincidences.为了进一步澄清,我已经检查了 list_SSNP 中的所有元素都出现在 Genome 中,因此绝对可以肯定这不是找不到任何巧合的问题。
EDIT:编辑:
I have come to realize that my example does not specify that the entries in list_SSNP and Genome are unique and have no duplicates and that Genome has about 30k lines of entries, while list_SSNP has 49. I just want to add a column in Genome and a number 1 in those rows where the entry exists in both Genome and list_SSNP.我开始意识到我的示例没有指定 list_SSNP 和 Genome 中的条目是唯一的并且没有重复项,并且 Genome 有大约 30k 行条目,而 list_SSNP 有 49 行。我只想在 Genome 中添加一个列和一个在 Genome 和 list_SSNP 中都存在该条目的那些行中编号为 1。
I believe this could help.我相信这会有所帮助。 You can try this code:
你可以试试这段代码:
#Data
Genome <- data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
stringsAsFactors = F)
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
#Collapse
vecc <- paste0(list_SSNP,collapse = '|')
#Contrast
Genome$SNP <- as.numeric(grepl(pattern = vecc,x = Genome$Gene_Symbol))
Output: Output:
Gene_Symbol CHR SNP
1 A1BG 19q13.43 0
2 A1BG-AS1 19q13.43 0
3 A1CF 10q11.23 0
4 A2M 12p13.31 0
5 PDE1C 12p13.31 1
I may miss something important here, but the problem is formulated quite specifically to its domain.我可能在这里错过了一些重要的东西,但这个问题是针对它的领域专门制定的。 So, when I abtsracted it, I may have overseen an issue with my proposed solultion.
所以,当我抽象它时,我可能已经监督了我提出的解决方案的一个问题。
However, I understand that list_SSNP can have a SNP entry multiple times.但是,我知道 list_SSNP 可以有一个 SNP 条目多次。 So first of all, you could create a list of unique SNPs with the count of its occurences
因此,首先,您可以创建一个包含其出现次数的唯一 SNP 列表
library(dplyr)
list_SSNP = data.frame(SNP = c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1"))
unique_SSNP = list_SSNP %>%
group_by(SNP) %>%
# the summarize() could be replaced by count I guess, but I usually use this for more control
summarize(count = n())
And now you use a left_join现在你使用 left_join
Genome = data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
stringsAsFactors = F)
Genome_extended = Genome %>%
left_join(unique_SSNP, by = c("Gene_Symbol" = "SNP"))
The count column in the extended dataframe would be NAs for non-existing SNPs and you could fill the NAs with a variety of commands from dplyr, tidyr or even base R.扩展 dataframe 中的计数列将是不存在 SNP 的 NA,您可以使用来自 dplyr、tidyr 甚至基本 ZE1E1D3D400573127E9ZEE0288 的各种命令填充 NA。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.