简体   繁体   English

R:根据列中的类别从数据框中删除重复项

[英]R: Remove duplicates from a dataframe based on categories in a column

Here is my example data set:这是我的示例数据集:

      Name Course Cateory
 1: Jason     ML      PT
 2: Jason     ML      DI
 3: Jason     ML      GT
 4: Jason     ML      SY
 5: Jason     DS      SY
 6: Jason     DS      DI
 7: Nancy     ML      PT
 8: Nancy     ML      SY
 9: Nancy     DS      DI
10: Nancy     DS      GT
11: James     ML      SY
12:  John     DS      GT

I want to delete the duplicate rows to have unique rows across the dataframe.我想删除重复的行以在整个数据框中具有唯一的行。 Deleting the duplicate rows is based on the values from the column category .删除重复行基于列category的值。 The preference for values in the category column are given in this order {'PT','DI','GT','SY'}. category列中的值的首选项按此顺序 {'PT','DI','GT','SY'} 给出。

My output dataframe looks like below:我的输出数据框如下所示:

  Name Course Cateory
1: Jason     ML      PT
2: Jason     DS      DI
3: Nancy     ML      PT
4: Nancy     DS      DI
5: James     ML      SY
6:  John     DS      GT

Currently, I am using a combination of for loop and if condition.目前,我正在使用for循环和if条件的组合。 Since the Input dataframe is massive (10 Million rows), it takes forever.由于输入数据框很大(1000 万行),它需要永远。 Is there a better and efficient way to perform the same?有没有更好更有效的方法来执行相同的操作?

Here is a snippet that does what you asked: 这是一个代码片段,可以满足您的要求:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

Output: 输出:

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

Idea is that we sort based on the priority structure. 想法是我们根据优先级结构进行排序。 Then we apply the unique operations, which will return the first match. 然后我们应用唯一的操作,这将返回第一个匹配。 The return will be what we want. 回报将是我们想要的。

Since you mentioned you have 10 million rows, here is a data.table solution: 既然你提到你有1000万行,这里有一个data.table解决方案:

library(data.table)

setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]

Result: 结果:

    Name Course Category
1: Jason     ML       PT
2: Jason     DS       DI
3: Nancy     ML       PT
4: Nancy     DS       DI
5: James     ML       SY
6:  John     DS       GT

Benchmarking: 标杆:

# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))

# Data prep Base R  
df1 <- df_large

df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))

df1 <- df1[order(df1$Category), ]

# Data prep data.table
df2 <- df_large

df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))

setDT(df2)

Results: 结果:

library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], 
               df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])

Unit: milliseconds
                                                      expr       min        lq      mean
            df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
 df2[, .SD[which.min(df2$Category)], by = .(Name, Course)]  387.8435  409.9365  436.4381
    median        uq       max neval
 1774.3131 1803.7565 2085.9722   100
  427.6739  451.1776  558.2749   100

Data: 数据:

df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), 
    Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 
    2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", 
"Course", "Category"), class = "data.frame", row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", 
"12:"))

You're not removing based on category , you're really trying to remove full duplicate rows from the dataframe. 你没有根据category删除,你真的试图从数据框中删除完整的重复行。

You can remove full duplicate rows by subsetting the dataframe: 您可以通过对数据帧进行子集化来删除完整的重复行:

base R:
df_without_dupes <- df[!duplicated(df),]

I would suggest using the dplyr package for this 我建议使用dplyr

See below: 见下文:

require(dplyr)

data %>% 
  mutate(
    Category_factored=as.numeric(factor(Category,levels=c('PT','DI','GT','SY'),labels=1:4))
  ) %>% 
  group_by(Name,Course) %>% 
  filter(
    Category_factored == min(Category_factored)
  )

In case you are new to R, install dplyr using install.packages('dplyr') 如果您是R新手,请使用install.packages('dplyr')安装dplyr

You'll need to create an index to represent the order of category. 您需要创建一个索引来表示类别的顺序。 Then sort based on the priority of your categories and dedup by Name and Course. 然后根据您的类别的优先级进行排序,并按名称和课程进行重复数据删除。

library(tidyverse)

#create index to sort by
index.df <- data.frame("Cateory" = c('PT',"DI","GT","SY"), "Index" = c(1,2,3,4))

#join to orig dataset
data <- left_join(data, index.df, by = "Cateory")

#sort by index, dedup with Name and Course
data %>% arrange(Index) %>% group_by(Name,Course) %>% 
distinct(Name,Course, .keep_all = TRUE) %>% select(-Index)

Quick benchmark for given solutions: 给定解决方案的快速基准:

library(microbenchmark)
library(tidyverse)
library(data.table)

# 1. Data set
df_raw <- data.frame(
  name = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"),
  course = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"),
  category = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"),
  stringsAsFactors = FALSE)

 # 3. Solution 'basic R'
 f1 <- function(){

 # 1. Create data set  
  df <- df_raw

 # 2. Convert 'category' as factor
 df$category <- factor(df$category, levels = c("PT", "DI", "GT", "SY"))

 # 3. Sort by 'category'
 df <- df[order(df$category), ]

 # 4. Select rows without duplicates by 'name' and 'course'
 df[!duplicated(df[,c('name', 'course')]), ]

}

# 4. Solution 'dplyr'
f2 <- function(){
  # 1. Create data set
  df <- df_raw

  # 2. Solution
  df_raw %>% 
    mutate(category_factored = as.numeric(factor(category, levels = c('PT','DI','GT','SY'), labels = 1:4))) %>% 
    group_by(name, course) %>% 
    filter(category_factored == min(category_factored))
}

# 5. Solution 'data.table'
f3 <- function(){
  # 1. Create data set
  df <- df_raw

  # 2. Solution
  setDT(df)[, .SD[which.min(factor(category, levels = c("PT","DI","GT","SY")))], by=.(name, course)]
}

# 6. Solution 'dplyr'
f4 <- function(){

  # 1. Create data set
  df <- df_raw

  # 2. Create 'index' to sort by
  df_index <- data.frame("category" = c('PT',"DI","GT","SY"), "index" = c(1, 2, 3, 4))

  # 3. Join to original dataset
  df <- left_join(df, df_index, by = "category")

  # 4. Sort by 'index', dedup with 'name' and 'course'
  df %>% 
    arrange(index) %>% 
    group_by(name, course) %>% 
    distinct(name, course, .keep_all = TRUE) %>% 
    select(-index)
}

# Test for solutions
microbenchmark(f1(), f2(), f3(), f4())

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval  cld
f1()  1.350875  1.468044  1.682641  1.603816  1.687203  5.006231   100 a   
f2() 12.547863 12.864521 13.766343 13.543806 14.227795 18.350335   100   c 
f3()  2.517014  2.634612  2.944483  2.792619  2.873013  9.355626   100  b  
f4() 21.073892 21.608212 23.246332 22.338600 23.934932 41.883938   100    d

The best solutions are f1() and f3() as you can see. 你可以看到最好的解决方案是f1()f3()

I may be late, but i believe this is the simplest solution.我可能会迟到,但我相信这是最简单的解决方案。 Since you mentioned 10m rows i propose a data.table implementation using the very understandable unique function既然你提到了 10m 行,我提出了一个使用非常容易理解的unique函数的 data.table 实现

require("data.table")
df <- data.table("Name" = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"), "Course" = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"), "category" = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"))

unique(df[, category := factor(category, levels = c("PT","DI","GT","SY"))][order(df$"category")], by = c("Name", "Course"))

    Name Course category
1: Jason     ML       PT
2: Nancy     ML       PT
3: Jason     DS       DI
4: Nancy     DS       DI
5:  John     DS       GT
6: James     ML       SY

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM