[英]How to remove duplicates in a loop in R
I have a loop which goes through a large number of.tsv files and runs a function to output results to one file.我有一个循环,它遍历大量 .tsv 文件并将 function 到 output 结果运行到一个文件。 The loop works, however a copy of the.tsv files have duplicate values in one of the columns which prevents the loop working.循环有效,但是 .tsv 文件的副本在其中一列中有重复值,这会阻止循环工作。 I need to remove the rows with the duplicate values in column V5.我需要删除列 V5 中具有重复值的行。 I have tried previous commands addressed on this site, but they are not working for some reason..我已经尝试过此站点上解决的先前命令,但由于某种原因它们无法正常工作..
My input.tsv files look like this (other_trait)我的 input.tsv 文件看起来像这样(other_trait)
V1 V2 V3 V4 V5
10 201874235 G T rs389130213
10 201876195 G C rs121467298
10 201876295 T A rs121467298
My code starts as below to format the files before running through function.我的代码开始如下格式化文件,然后运行 function。
files <- list.files(path =".", pattern = ".tsv")
files
datalist = list()
for(i in 1:length(files)) {
other_trait <- read.table(files[i])
colnames(other_trait)[which(names(other_trait) == "V2")] <- "BP"
other_trait<- merge(other_trait, subset_1[,c("BP","MAF")], by="BP")
other_trait <- unique(other_trait$V5)
I have tried using unique as above and also other_trait <- other_trait[,(duplicated(other_trait$V5)), ]
Unique deletes row the other values in dataframe and just retains the unique values in V5, and !(duplicated) doesn't seem to do anything!我已经尝试使用上面的 unique 以及other_trait <- other_trait[,(duplicated(other_trait$V5)), ]
Unique 删除行 dataframe 中的其他值并且只保留 V5 中的唯一值,并且 !(duplicated) 没有似乎什么都做!
df <- read.table(text = "V1 V2 V3 V4 V5
10 201874235 G T rs389130213
10 201876195 G C rs121467298
10 201876295 T A rs121467298", h = T)
library(dplyr)
df %>%
rename(BP = V2) %>%
left_join(subset_1[,c("BP","MAF")], by="BP") %>%
distinct(V5, .keep_all = T)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.