[英]R: Remove duplicates from a dataframe based on categories in a column
Here is my example data set:这是我的示例数据集:
Name Course Cateory
1: Jason ML PT
2: Jason ML DI
3: Jason ML GT
4: Jason ML SY
5: Jason DS SY
6: Jason DS DI
7: Nancy ML PT
8: Nancy ML SY
9: Nancy DS DI
10: Nancy DS GT
11: James ML SY
12: John DS GT
I want to delete the duplicate rows to have unique rows across the dataframe.我想删除重复的行以在整个数据框中具有唯一的行。 Deleting the duplicate rows is based on the values from the column
category
.删除重复行基于列
category
的值。 The preference for values in the category
column are given in this order {'PT','DI','GT','SY'}. category
列中的值的首选项按此顺序 {'PT','DI','GT','SY'} 给出。
My output dataframe looks like below:我的输出数据框如下所示:
Name Course Cateory
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
Currently, I am using a combination of for
loop and if
condition.目前,我正在使用
for
循环和if
条件的组合。 Since the Input dataframe is massive (10 Million rows), it takes forever.由于输入数据框很大(1000 万行),它需要永远。 Is there a better and efficient way to perform the same?
有没有更好更有效的方法来执行相同的操作?
Here is a snippet that does what you asked: 这是一个代码片段,可以满足您的要求:
df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))
df <- df[order(df$Category),]
df[!duplicated(df[,c('Name', 'Course')]),]
Output: 输出:
Name Course Category
Jason ML PT
Nancy ML PT
Jason DS DI
Nancy DS DI
John DS GT
James ML SY
Idea is that we sort based on the priority structure. 想法是我们根据优先级结构进行排序。 Then we apply the unique operations, which will return the first match.
然后我们应用唯一的操作,这将返回第一个匹配。 The return will be what we want.
回报将是我们想要的。
Since you mentioned you have 10 million rows, here is a data.table
solution: 既然你提到你有1000万行,这里有一个
data.table
解决方案:
library(data.table)
setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]
Result: 结果:
Name Course Category
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
Benchmarking: 标杆:
# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))
# Data prep Base R
df1 <- df_large
df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))
df1 <- df1[order(df1$Category), ]
# Data prep data.table
df2 <- df_large
df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))
setDT(df2)
Results: 结果:
library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ],
df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])
Unit: milliseconds
expr min lq mean
df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
df2[, .SD[which.min(df2$Category)], by = .(Name, Course)] 387.8435 409.9365 436.4381
median uq max neval
1774.3131 1803.7565 2085.9722 100
427.6739 451.1776 558.2749 100
Data: 数据:
df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"),
Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L,
2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name",
"Course", "Category"), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:",
"12:"))
You're not removing based on category
, you're really trying to remove full duplicate rows from the dataframe. 你没有根据
category
删除,你真的试图从数据框中删除完整的重复行。
You can remove full duplicate rows by subsetting the dataframe: 您可以通过对数据帧进行子集化来删除完整的重复行:
base R:
df_without_dupes <- df[!duplicated(df),]
I would suggest using the dplyr
package for this 我建议使用
dplyr
包
See below: 见下文:
require(dplyr)
data %>%
mutate(
Category_factored=as.numeric(factor(Category,levels=c('PT','DI','GT','SY'),labels=1:4))
) %>%
group_by(Name,Course) %>%
filter(
Category_factored == min(Category_factored)
)
In case you are new to R, install dplyr using install.packages('dplyr')
如果您是R新手,请使用
install.packages('dplyr')
安装dplyr
You'll need to create an index to represent the order of category. 您需要创建一个索引来表示类别的顺序。 Then sort based on the priority of your categories and dedup by Name and Course.
然后根据您的类别的优先级进行排序,并按名称和课程进行重复数据删除。
library(tidyverse)
#create index to sort by
index.df <- data.frame("Cateory" = c('PT',"DI","GT","SY"), "Index" = c(1,2,3,4))
#join to orig dataset
data <- left_join(data, index.df, by = "Cateory")
#sort by index, dedup with Name and Course
data %>% arrange(Index) %>% group_by(Name,Course) %>%
distinct(Name,Course, .keep_all = TRUE) %>% select(-Index)
Quick benchmark for given solutions: 给定解决方案的快速基准:
library(microbenchmark)
library(tidyverse)
library(data.table)
# 1. Data set
df_raw <- data.frame(
name = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"),
course = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"),
category = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"),
stringsAsFactors = FALSE)
# 3. Solution 'basic R'
f1 <- function(){
# 1. Create data set
df <- df_raw
# 2. Convert 'category' as factor
df$category <- factor(df$category, levels = c("PT", "DI", "GT", "SY"))
# 3. Sort by 'category'
df <- df[order(df$category), ]
# 4. Select rows without duplicates by 'name' and 'course'
df[!duplicated(df[,c('name', 'course')]), ]
}
# 4. Solution 'dplyr'
f2 <- function(){
# 1. Create data set
df <- df_raw
# 2. Solution
df_raw %>%
mutate(category_factored = as.numeric(factor(category, levels = c('PT','DI','GT','SY'), labels = 1:4))) %>%
group_by(name, course) %>%
filter(category_factored == min(category_factored))
}
# 5. Solution 'data.table'
f3 <- function(){
# 1. Create data set
df <- df_raw
# 2. Solution
setDT(df)[, .SD[which.min(factor(category, levels = c("PT","DI","GT","SY")))], by=.(name, course)]
}
# 6. Solution 'dplyr'
f4 <- function(){
# 1. Create data set
df <- df_raw
# 2. Create 'index' to sort by
df_index <- data.frame("category" = c('PT',"DI","GT","SY"), "index" = c(1, 2, 3, 4))
# 3. Join to original dataset
df <- left_join(df, df_index, by = "category")
# 4. Sort by 'index', dedup with 'name' and 'course'
df %>%
arrange(index) %>%
group_by(name, course) %>%
distinct(name, course, .keep_all = TRUE) %>%
select(-index)
}
# Test for solutions
microbenchmark(f1(), f2(), f3(), f4())
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 1.350875 1.468044 1.682641 1.603816 1.687203 5.006231 100 a
f2() 12.547863 12.864521 13.766343 13.543806 14.227795 18.350335 100 c
f3() 2.517014 2.634612 2.944483 2.792619 2.873013 9.355626 100 b
f4() 21.073892 21.608212 23.246332 22.338600 23.934932 41.883938 100 d
The best solutions are f1() and f3() as you can see. 你可以看到最好的解决方案是f1()和f3() 。
I may be late, but i believe this is the simplest solution.我可能会迟到,但我相信这是最简单的解决方案。 Since you mentioned 10m rows i propose a data.table implementation using the very understandable
unique
function既然你提到了 10m 行,我提出了一个使用非常容易理解的
unique
函数的 data.table 实现
require("data.table")
df <- data.table("Name" = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"), "Course" = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"), "category" = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"))
unique(df[, category := factor(category, levels = c("PT","DI","GT","SY"))][order(df$"category")], by = c("Name", "Course"))
Name Course category
1: Jason ML PT
2: Nancy ML PT
3: Jason DS DI
4: Nancy DS DI
5: John DS GT
6: James ML SY
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.