[英]How can I delete “a lot” of rows from a dataframe in r
I tried all the similar posts but none of the answers seemed to work for me. 我尝试了所有类似的帖子,但似乎没有答案对我有用。 I want to delete 8500+ rows (by rowname only) from a dataframe with 27,000+.
我想从27,000+的数据框中删除8500+行(仅按行名)。 The other columns are completely different, but the smaller dataset was derived from the larger one, and just looking for names shows me that whatever I look for from smaller df it is present in larger df.
其他列完全不同,但是较小的数据集是从较大的数据集派生而来的,仅查找名称就可以表明,无论我从较小的df寻找什么,它都存在于较大的df中。 I could of course do this manually (busy work for sure!), but seems like there should be a simple computational answer.
我当然可以手动执行此操作(肯定是忙碌的工作!),但是似乎应该有一个简单的计算答案。
I have tried: 我努力了:
fordel<-df2[1,]
df3<-df1[!rownames(df1) %in% fordel
l1<- as.vector(df2[1,])
df3<- df1[1-c(l1),]
and lots of other crazy ideas! 还有很多其他疯狂的想法! Here is a smallish example: df1:
这是一个小例子:df1:
Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
ENSMUSG00000000001.4 10634 6954 6835 6510
ENSMUSG00000000003.15 0 0 0 0
ENSMUSG00000000028.14 559 1570 807 1171
ENSMUSG00000000031.15 5748 174 4103 146
ENSMUSG00000000037.16 37 194 49 96
ENSMUSG00000000049.11 0 3 1 0
ENSMUSG00000000056.7 1157 1125 806 947
ENSMUSG00000000058.6 75 304 123 169
ENSMUSG00000000078.6 4012 4391 5637 3854
ENSMUSG00000000085.16 381 560 482 368
ENSMUSG00000000088.6 2667 4777 3483 3450
ENSMUSG00000000093.6 3 48 41 22
ENSMUSG00000000094.12 23 201 102 192
df2 df2
structure(list(base_mean = c(7962.408875, 947.1240794, 43.76698418 ), log2foldchange = c(-0.363434063, -0.137403759, -0.236463207 ), lfcSE = c(0.096816743, 0.059823215, 0.404929452), stat = c(-3.753834854, -2.296830066, -0.583961493)), row.names = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7", "ENSMUSG00000000093.6"), class = "data.frame")
I want to delete from df1 the rows corresponding to the rownames in df2. 我想从df1中删除与df2中的行名相对应的行。 Tried to format it, but seems no longer formatted... oh well....
试图格式化,但似乎不再格式化了。
Suggestions really appreciated! 建议真的很感激!
You mentioned row names but your data does not include that, so I'll assume that they really don't matter (or exist). 您提到了行名,但您的数据不包括该行名,因此我假设它们确实无关紧要(或存在)。 Also, your
df2
has more column headers than columns, not sure what's going on there ... so I'll ignore it. 另外,您的
df2
列标题多于列,不确定发生了什么……所以我将忽略它。
df1 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000003.15",
"ENSMUSG00000000028.14", "ENSMUSG00000000031.15", "ENSMUSG00000000037.16",
"ENSMUSG00000000049.11", "ENSMUSG00000000056.7", "ENSMUSG00000000058.6",
"ENSMUSG00000000078.6", "ENSMUSG00000000085.16", "ENSMUSG00000000088.6",
"ENSMUSG00000000093.6", "ENSMUSG00000000094.12"), clone57_RNA = c(10634L,
0L, 559L, 5748L, 37L, 0L, 1157L, 75L, 4012L, 381L, 2667L, 3L,
23L), clone43_RNA_2 = c(6954L, 0L, 1570L, 174L, 194L, 3L, 1125L,
304L, 4391L, 560L, 4777L, 48L, 201L), clone67_RNA = c(6835L,
0L, 807L, 4103L, 49L, 1L, 806L, 123L, 5637L, 482L, 3483L, 41L,
102L), clone55_RNA = c(6510L, 0L, 1171L, 146L, 96L, 0L, 947L,
169L, 3854L, 368L, 3450L, 22L, 192L)), class = "data.frame", row.names = c(NA,
-13L))
df2 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7",
"ENSMUSG00000000093.6"), base_mean = c(7962.408875, 947.1240794,
43.76698418), log2foldchange = c(-0.36343406, -0.137403759, -0.236463207
), pvalue = c(0.00017415, 0.021628466, 0.55924622)), class = "data.frame", row.names = c(NA,
-3L))
df1[!df1$Ent_gene_id %in% df2$Ent_gene_id,]
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 2 ENSMUSG00000000003.15 0 0 0 0
# 3 ENSMUSG00000000028.14 559 1570 807 1171
# 4 ENSMUSG00000000031.15 5748 174 4103 146
# 5 ENSMUSG00000000037.16 37 194 49 96
# 6 ENSMUSG00000000049.11 0 3 1 0
# 8 ENSMUSG00000000058.6 75 304 123 169
# 9 ENSMUSG00000000078.6 4012 4391 5637 3854
# 10 ENSMUSG00000000085.16 381 560 482 368
# 11 ENSMUSG00000000088.6 2667 4777 3483 3450
# 13 ENSMUSG00000000094.12 23 201 102 192
dplyr::anti_join(df1, df2, by = "Ent_gene_id")
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 1 ENSMUSG00000000003.15 0 0 0 0
# 2 ENSMUSG00000000028.14 559 1570 807 1171
# 3 ENSMUSG00000000031.15 5748 174 4103 146
# 4 ENSMUSG00000000037.16 37 194 49 96
# 5 ENSMUSG00000000049.11 0 3 1 0
# 6 ENSMUSG00000000058.6 75 304 123 169
# 7 ENSMUSG00000000078.6 4012 4391 5637 3854
# 8 ENSMUSG00000000085.16 381 560 482 368
# 9 ENSMUSG00000000088.6 2667 4777 3483 3450
# 10 ENSMUSG00000000094.12 23 201 102 192
Edit : same thing but with row names: 编辑 :同一件事,但具有行名:
# update my df1 to change Ent_gene_id from a column to rownames
rownames(df1) <- df1$Ent_gene_id
df1$Ent_gene_id <- NULL
# use your updated df2 (from dput)
# df2 <- structure(...)
df1[ !rownames(df1) %in% rownames(df2), ]
# clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# ENSMUSG00000000003.15 0 0 0 0
# ENSMUSG00000000028.14 559 1570 807 1171
# ENSMUSG00000000031.15 5748 174 4103 146
# ENSMUSG00000000037.16 37 194 49 96
# ENSMUSG00000000049.11 0 3 1 0
# ENSMUSG00000000058.6 75 304 123 169
# ENSMUSG00000000078.6 4012 4391 5637 3854
# ENSMUSG00000000085.16 381 560 482 368
# ENSMUSG00000000088.6 2667 4777 3483 3450
# ENSMUSG00000000094.12 23 201 102 192
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.