简体   繁体   English

根据是否在其他行中重复,在R中使用dplyr添加一列

[英]add a column using dplyr in R based on if duplicated in other rows

I would like to add a column to dataframe based on condition if duplicated in other rows. 我想根据条件在数据框中添加一列(如果在其他行中重复)。 My dataframe like this: 我的数据框是这样的:

group label value   newColumn
1     1     3
1     2     4
1     3     3
1     4     5
1     5     4
2     1     6
2     2     3
2     3     9
2     4     6
2     5     1
2     6     3

I want to add a column: 我想添加一列:

if df$value[i] is duplicated and df$value[i] is the original, set newColumn[i] to 0; 
if df$value[i] is duplicated and df$value[i] is the duplicate, set newColumn[i] to the label of the original;
if df$value[i] is not duplicated, set df$newColumn[i] to 0.

for example: 例如:

df$value[1] = 3 is duplicated, but it is the original, so we set newColumn[1] = 0;
df$value[3] = 3 is duplicated, and it is the duplicate, so we set newColumn[3] = 1 (=df$label[1]);

here is my code: 这是我的代码:

library(dplyr)

df <- df %>%
group_by(group) %>%
mutate(
newColumn = ifelse(row_number() == min( which(duplicated(value) | duplicated(value, fromLast = TRUE)) ), 
                           label[max( which(duplicated(value) | duplicated(value, fromLast = TRUE)))],
                           0)
)

but it does not help. 但这无济于事。 Any suggestion? 有什么建议吗? Thank you in advance! 先感谢您!

Here's a solution using ave() : 这是使用ave()的解决方案:

df$newColumn <- ave(df$label,df$value,FUN=function(x) c(0L,rep(x[1L],length(x)-1L)))
df;
##    group label value newColumn
## 1      1     1     3         0
## 2      1     2     4         0
## 3      1     3     3         1
## 4      1     4     5         0
## 5      1     5     4         2
## 6      2     1     6         0
## 7      2     2     3         1
## 8      2     3     9         0
## 9      2     4     6         1
## 10     2     5     1         0
## 11     2     6     3         1

ave() breaks up the first argument into groups according to the second argument and calls the lambda once for each group. ave()根据第二个参数将第一个参数分成多个组,并为每个组调用一次lambda。 So, for example, for all rows where df$value is equal to 3, ave() will construct a vector consisting of all values of df$label from those rows, and call the lambda with x equal to that vector. 因此,例如,对于df$value等于3的所有行, ave()将构造一个由这些行中df$label的所有值组成的向量,并调用x等于该向量的lambda。 The return value of the lambda call is expected to contain the same number of elements as the argument x (or it will be recycled as necessary to make it so). lambda调用的返回值应包含与参数x相同数量的元素(否则将被循环使用)。

The return values of all calls of the lambda are then combined into one final vector, with each element of each return value placed into the position corresponding to its counterpart from the input. 然后,将所有lambda调用的返回值合并为一个最终向量,每个返回值的每个元素都放置在与输入对应的位置中。 This allows us to build the final column vector by group. 这使我们可以按组构建最终的列向量。 Since your problem requires returning zero for the first element in each group and the original label value for all subsequent elements in each group, we can build that subvector easily in the lambda by combining zero with the original label value repeated sufficiently to cover the remainder of the group vector. 由于您的问题要求为每个组中的第一个元素返回零,并为每个组中的所有后续元素返回原始标签值,因此,我们可以通过将零与重复重复的原始标签值进行组合以轻松覆盖lambda中的子向量,以覆盖剩余的组向量。

We can also use data.table 我们也可以使用data.table

library(data.table)
setDT(df)[, newColumn := c(0, rep(label[1L], .N-1)) , value]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM