[英]Count flags across multiple rows depending on key
I have a dataset that consists of customers and accounts where a customer can have multiple accounts. 我有一个由客户和帐户组成的数据集,客户可以拥有多个帐户。 The dataset has several 'flags' on each account.
数据集在每个帐户上都有几个“标志”。
I'm trying to get a count of 'unique' hits on these flags per customer, ie if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts. 我试图得到每个客户对这些标志的“唯一”点击次数,即如果3个帐户有flag1我想要计为1次点击,但如果其中一个帐户也有flag2我也希望这个算作2.基本上,我想看看每个客户在他们所有帐户中点击了多少个标记。
Example Input data frame:
cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1
Example Output dataframe:
cust acct flag1 flag2 flag3 UniqueSum
a 123 0 1 0 2
a 456 1 1 0 2
b 789 1 1 1 3
c 428 0 1 0 2
c 247 0 1 0 2
c 483 0 1 1 2
I've tried to use the following: 我试过使用以下内容:
fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))
but this sums the acct
column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts. 但这总结了
acct
列,每个客户给我一行,我希望与客户拥有相同数量的行。
Using data.table
: 使用
data.table
:
require(data.table) # v1.9.6
dt[, un := sum(sapply(.SD, max)), by = cust, .SDcols = flag1:flag3]
We group by cust
, and on the subdata for each group for columns flag1, flag2, flag3
(achieved using .SD
and .SDcols
), we extract each column's max
, and summing it up would give the total number of 1
's. 我们按
cust
分组,并且对于列flag1, flag2, flag3
(使用.SD
和.SDcols
实现)的每个组的子数据 ,我们提取每列的max
,并将其求和将得到1
的总数。
We update the original table with these values by reference using the LHS := RHS
notation (see Reference Semantics vignette). 我们使用
LHS := RHS
表示法通过引用更新原始表和这些值(请参阅参考语义晕影)。
where dt
is: 其中
dt
是:
dt = fread('cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1')
One way that comes to my mind, is to colSum
for each cust
and check which are greater than 0. For example, 我想到的一种方法是每个
cust
colSum
和大于0的检查。例如,
> tab
cust acct flag1 flag2 flag3
1 a 123 0 1 0
2 a 456 1 1 0
3 b 789 1 1 1
4 c 428 0 1 0
5 c 247 0 1 0
6 c 483 0 1 1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
cust acct flag1 flag2 flag3 uniqueSums
1 a 123 0 1 0 2
2 a 456 1 1 0 2
3 b 789 1 1 1 3
4 c 428 0 1 0 2
5 c 247 0 1 0 2
6 c 483 0 1 1 2
For each value of cust
, the function in sapply
finds the rows, does a vectorized sum and checks for values that are greater than 0. 对于
cust
每个值, sapply
的函数查找行,执行向量化求和并检查大于0的值。
Here's an approach using library(dplyr)
: 这是使用
library(dplyr)
:
df %>%
group_by(cust) %>%
summarise_each(funs(max), -acct) %>%
mutate(UniqueSum = rowSums(.[-1])) %>%
select(-starts_with("flag")) %>%
right_join(df, "cust")
#Source: local data frame [6 x 6]
#
# cust UniqueSum acct flag1 flag2 flag3
# (fctr) (dbl) (int) (int) (int) (int)
#1 a 2 123 0 1 0
#2 a 2 456 1 1 0
#3 b 3 789 1 1 1
#4 c 2 428 0 1 0
#5 c 2 247 0 1 0
#6 c 2 483 0 1 1
I was able to answer my own question after reading Roman's post, I did something like this where f data is my dataset. 在阅读了Roman的帖子之后,我能够回答我自己的问题,我做了类似这样的事情,其中f数据是我的数据集。
fSumData <- ddply(fData, "cust", numcolwise(sum))
fSumData$UniqueHits <- ifelse(fSumData$flag1 >= 1;1,0) + ifelse(fSumData$flag2 >= 1;1;0) + ifelse(fSumData$flag3 >= 1;1;0)
I found this to be a bit faster than Roman's solution when running against my dataset, but am unsure if it's the optimal solution. 在针对我的数据集运行时,我发现这比Roman的解决方案快一点,但我不确定它是否是最佳解决方案。 Thank you all for your input this helped a ton!
谢谢大家的帮助,这对你有所帮助!
The underused rowsum
could be, also, of use: 未充分利用的
rowsum
也可以使用:
rowSums(rowsum(DF[-(1:2)], DF$cust) > 0)[DF$cust]
#a a b c c c
#2 2 3 2 2 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.