简体   繁体   中英

How to remove duplicated rows based on 3 columns for only one factor level?

在此处输入图片说明

I have a list of 130 dataframes each with 27 columns and 2 factor levels per dataframe. I want to remove the duplicated rows in each dataframe based on 3 columns for one factor level only, keeping all rows in the other factor level and their duplicates.

I have sorted all the dataframes according to the factor levels and then I tried to remove the duplicated rows only for the first factor level.

The list is called x and i index between the dataframes in list with x[[i]] , with i running from 1 to 130 .

在此处输入图片说明

在此处输入图片说明

The column in every dataframe called temp contains 2 factor levels, either 0 or 1 . The 130 dataframes have been ordered according to level = 0 first and then level=1 .

for (i in 1:130)
{
x[[i]]$temp <- factor(x[[i]]$temp,levels = c(0,1)) 

# Creating 2 factor levels called `0` and `1` in column called `temp` and index position of the `temp` column is `24`

x[[i]] <- x[[i]][order(x[[i]]$temp),] 

# Ordering all of the dataframes by levels; level = 0 first then level = 1

x[[i]] <- x[[i]][!(duplicated(x[[i]][c(2,27,25)])),] 

# This is removing duplicated based on columns 2,27 and 25, but I to perform this only for the first factor level = 0
}

For a single data frame, say df , you can do the following:

library(dplyr)
df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)

Note that you don't have to consider grouping on your factor, because if you have rows for both factors with repeated values for columns 2, 27 and 25, they are still two distinct columns.

The key here is the argument .keep_all , which keeps the remaining columns. Note however that if the remaining columns differ in some why, it is undetermined which rows you get back, you just get 1 row for each distinct combination of temp and columns 2, 27 and 25.

To expand to a list of data.frames, you can use lapply :

lapply(x, function(df) {
  df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)
}) %>% bind_rows(.id='date')

where the last call to bind_rows simply compresses everything into a single data frame, with the added .id argument to add a column named date whose values should be the entry names in your input list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM