简体   繁体   English

如何仅针对一个因子级别删除基于 3 列的重复行?

[英]How to remove duplicated rows based on 3 columns for only one factor level?

在此处输入图片说明

I have a list of 130 dataframes each with 27 columns and 2 factor levels per dataframe.我有一个包含130数据框的列表,每个数据框有27列,每个数据框有2因子级别。 I want to remove the duplicated rows in each dataframe based on 3 columns for one factor level only, keeping all rows in the other factor level and their duplicates.我想删除每个数据框中的重复行,仅基于一个因子级别的3列,保留另一个因子级别中的所有行及其重复项。

I have sorted all the dataframes according to the factor levels and then I tried to remove the duplicated rows only for the first factor level.我已经根据因子级别对所有数据框进行了排序,然后我尝试仅删除第一个因子级别的重复行。

The list is called x and i index between the dataframes in list with x[[i]] , with i running from 1 to 130 .该列表在列表中的数据帧之间称为xi索引,其中包含x[[i]] ,其中i1130运行。

在此处输入图片说明

在此处输入图片说明

The column in every dataframe called temp contains 2 factor levels, either 0 or 1 .每个名为temp数据框中的列包含2因子水平, 01 The 130 dataframes have been ordered according to level = 0 first and then level=1 . 130数据帧已根据level = 0首先排序,然后level=1

for (i in 1:130)
{
x[[i]]$temp <- factor(x[[i]]$temp,levels = c(0,1)) 

# Creating 2 factor levels called `0` and `1` in column called `temp` and index position of the `temp` column is `24`

x[[i]] <- x[[i]][order(x[[i]]$temp),] 

# Ordering all of the dataframes by levels; level = 0 first then level = 1

x[[i]] <- x[[i]][!(duplicated(x[[i]][c(2,27,25)])),] 

# This is removing duplicated based on columns 2,27 and 25, but I to perform this only for the first factor level = 0
}

For a single data frame, say df , you can do the following:对于单个数据框,比如说df ,您可以执行以下操作:

library(dplyr)
df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)

Note that you don't have to consider grouping on your factor, because if you have rows for both factors with repeated values for columns 2, 27 and 25, they are still two distinct columns.请注意,您不必考虑对因子进行分组,因为如果两个因子的行都具有重复值的列 2、27 和 25,它们仍然是两个不同的列。

The key here is the argument .keep_all , which keeps the remaining columns.这里的关键是参数.keep_all ,它保留剩余的列。 Note however that if the remaining columns differ in some why, it is undetermined which rows you get back, you just get 1 row for each distinct combination of temp and columns 2, 27 and 25.但是请注意,如果剩余的列在某些原因上有所不同,则不确定您返回哪些行,对于temp和第 2、27 和 25 列的每个不同组合,您只会获得 1 行。

To expand to a list of data.frames, you can use lapply :要扩展到 data.frames 列表,您可以使用lapply

lapply(x, function(df) {
  df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)
}) %>% bind_rows(.id='date')

where the last call to bind_rows simply compresses everything into a single data frame, with the added .id argument to add a column named date whose values should be the entry names in your input list.其中对bind_rows的最后一次调用只是将所有内容压缩到单个数据框中,并添加.id参数以添加名为date的列,其值应为输入列表中的条目名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM