[英]Match and Remove Rows Based on Condition R
I've got an interesting one for you all. 我为你们准备了一个有趣的东西。
I'm looking to first: Look through the ID column and identify duplicate values. 我首先要查看:查看ID列并确定重复值。 Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.
一旦确定了这些,代码应该通过重复值的收入并保持行具有更大的收入。
So if there are three ID values of 2, it will look for the one with the highest income and keep that row. 因此,如果有三个ID值为2,它将查找收入最高的那个并保留该行。
ID Income 1 98765 2 3456 2 67 2 5498 5 23 6 98 7 5645 7 67871 9 983754 10 982 10 2374 10 875 10 4744 11 6853
I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match) 我知道它像基于条件的子集一样容易,但我不知道如何根据一个单元格中的收入是否大于另一个单元格来删除行。(仅在id匹配时才进行)
I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. 我正在考虑使用ifelse语句来创建一个新列以识别重复项(通过子集化或不通过子集化),然后再使用新列的值来确定更大的收入。 From there I can just subset based on the new columns I have created.
从那里我可以根据我创建的新列进行子集化。
Is there a faster, more efficient way of doing this? 有更快,更有效的方法吗?
The outcome should look like this. 结果应该是这样的。
ID Income 1 98765 2 5498 5 23 6 98 7 67871 9 983754 10 4744 11 6853
Thank you 谢谢
We can slice
the rows by checking the highest value in 'Income' grouped by 'ID' 我们可以通过检查“收入”中按“ID”分组的最高值来
slice
行
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.max(Income))
Or using data.table
或者使用
data.table
library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]
Or with base R
或者用
base R
df1[with(df1, ave(Income, ID, FUN = max) == Income),]
# ID Income
#1 1 98765
#4 2 5498
#5 5 23
#6 6 98
#8 7 67871
#9 9 983754
#13 10 4744
#14 11 6853
df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L,
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L,
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)),
class = "data.frame", row.names = c(NA,
-14L))
order
with duplicated
( Base R) order
duplicated
(Base R)
df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
ID Income
1 1 98765
4 2 5498
5 5 23
6 6 98
8 7 67871
9 9 983754
13 10 4744
14 11 6853
Here is another dplyr
method. 这是另一种
dplyr
方法。 We can arrange the column and then slice the data frame for the first row. 我们可以排列列,然后切片第一行的数据帧。
library(dplyr)
df2 <- df %>%
arrange(ID, desc(Income)) %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
df2
# # A tibble: 8 x 2
# ID Income
# <int> <int>
# 1 1 98765
# 2 2 5498
# 3 5 23
# 4 6 98
# 5 7 67871
# 6 9 983754
# 7 10 4744
# 8 11 6853
DATA 数据
df <- read.table(text = "ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853",
header = TRUE)
Group_by and summarise from dplyr would work too 来自dplyr的Group_by和总结也会起作用
df1 %>%
group_by(ID) %>%
summarise(Income=max(Income))
ID Income
<int> <dbl>
1 1 98765.
2 2 5498.
3 5 23.
4 6 98.
5 7 67871.
6 9 983754.
7 10 4744.
8 11 6853.
Using sqldf
: Group by ID
and select the corresponding max Income
使用
sqldf
:按ID
分组并选择相应的max Income
library(sqldf)
sqldf("select ID,max(Income) from df group by ID")
Output: 输出:
ID max(Income)
1 1 98765
2 2 5498
3 5 23
4 6 98
5 7 67871
6 9 983754
7 10 4744
8 11 6853
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.