根据键计算多行的标志

Question

I have a dataset that consists of customers and accounts where a customer can have multiple accounts. 我有一个由客户和帐户组成的数据集，客户可以拥有多个帐户。 The dataset has several 'flags' on each account. 数据集在每个帐户上都有几个“标志”。

I'm trying to get a count of 'unique' hits on these flags per customer, ie if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts. 我试图得到每个客户对这些标志的“唯一”点击次数，即如果3个帐户有flag1我想要计为1次点击，但如果其中一个帐户也有flag2我也希望这个算作2.基本上，我想看看每个客户在他们所有帐户中点击了多少个标记。

Example Input data frame:
    cust  acct flag1 flag2 flag3
    a     123    0    1      0
    a     456    1    1      0
    b     789    1    1      1
    c     428    0    1      0
    c     247    0    1      0
    c     483    0    1      1
Example Output dataframe:
    cust  acct flag1 flag2 flag3 UniqueSum
    a     123    0    1      0      2
    a     456    1    1      0      2
    b     789    1    1      1      3
    c     428    0    1      0      2
    c     247    0    1      0      2
    c     483    0    1      1      2

I've tried to use the following: 我试过使用以下内容：

fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))

but this sums the acct column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts. 但这总结了acct列，每个客户给我一行，我希望与客户拥有相同数量的行。

Answer 1

Using data.table : 使用data.table ：

require(data.table) # v1.9.6
dt[, un := sum(sapply(.SD, max)), by = cust, .SDcols = flag1:flag3]

We group by cust , and on the subdata for each group for columns flag1, flag2, flag3 (achieved using .SD and .SDcols ), we extract each column's max , and summing it up would give the total number of 1 's. 我们按cust分组，并且对于列flag1, flag2, flag3 （使用.SD和.SDcols实现）的每个组的子数据 ，我们提取每列的max ，并将其求和将得到1的总数。

We update the original table with these values by reference using the LHS := RHS notation (see Reference Semantics vignette). 我们使用LHS := RHS表示法通过引用更新原始表和这些值（请参阅参考语义晕影）。

where dt is: 其中dt是：

dt = fread('cust  acct flag1 flag2 flag3
a     123    0    1      0
a     456    1    1      0
b     789    1    1      1
c     428    0    1      0
c     247    0    1      0
c     483    0    1      1')

Answer 2

One way that comes to my mind, is to colSum for each cust and check which are greater than 0. For example, 我想到的一种方法是每个cust colSum和大于0的检查。例如，

> tab
  cust acct flag1 flag2 flag3
1    a  123     0     1     0
2    a  456     1     1     0
3    b  789     1     1     1
4    c  428     0     1     0
5    c  247     0     1     0
6    c  483     0     1     1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
  cust acct flag1 flag2 flag3 uniqueSums
1    a  123     0     1     0          2
2    a  456     1     1     0          2
3    b  789     1     1     1          3
4    c  428     0     1     0          2
5    c  247     0     1     0          2
6    c  483     0     1     1          2

For each value of cust , the function in sapply finds the rows, does a vectorized sum and checks for values that are greater than 0. 对于cust每个值， sapply的函数查找行，执行向量化求和并检查大于0的值。

Answer 3

Here's an approach using library(dplyr) : 这是使用library(dplyr) ：

df %>% 
  group_by(cust) %>% 
  summarise_each(funs(max), -acct) %>% 
  mutate(UniqueSum = rowSums(.[-1])) %>% 
  select(-starts_with("flag")) %>% 
  right_join(df, "cust")

#Source: local data frame [6 x 6]
#
#    cust UniqueSum  acct flag1 flag2 flag3
#  (fctr)     (dbl) (int) (int) (int) (int)
#1      a         2   123     0     1     0
#2      a         2   456     1     1     0
#3      b         3   789     1     1     1
#4      c         2   428     0     1     0
#5      c         2   247     0     1     0
#6      c         2   483     0     1     1

Answer 4

I was able to answer my own question after reading Roman's post, I did something like this where f data is my dataset. 在阅读了Roman的帖子之后，我能够回答我自己的问题，我做了类似这样的事情，其中f数据是我的数据集。

fSumData <- ddply(fData, "cust", numcolwise(sum))
fSumData$UniqueHits <- ifelse(fSumData$flag1 >= 1;1,0) + ifelse(fSumData$flag2 >= 1;1;0) + ifelse(fSumData$flag3 >= 1;1;0)

I found this to be a bit faster than Roman's solution when running against my dataset, but am unsure if it's the optimal solution. 在针对我的数据集运行时，我发现这比Roman的解决方案快一点，但我不确定它是否是最佳解决方案。 Thank you all for your input this helped a ton! 谢谢大家的帮助，这对你有所帮助！

Answer 5

The underused rowsum could be, also, of use: 未充分利用的rowsum也可以使用：

rowSums(rowsum(DF[-(1:2)], DF$cust) > 0)[DF$cust]
#a a b c c c 
#2 2 3 2 2 2

根据键计算多行的标志

问题描述

5 个解决方案

解决方案1
3 2015-11-18 23:02:14

解决方案2
1 已采纳 2015-11-18 21:32:11

解决方案3
1 2015-11-18 22:50:42

解决方案4
0 2015-11-18 23:23:30

解决方案5
0 2015-11-19 13:24:03

根据键计算多行的标志

问题描述

5 个解决方案

解决方案1 3 2015-11-18 23:02:14

解决方案2 1 已采纳 2015-11-18 21:32:11

解决方案3 1 2015-11-18 22:50:42

解决方案4 0 2015-11-18 23:23:30

解决方案5 0 2015-11-19 13:24:03

解决方案1
3 2015-11-18 23:02:14

解决方案2
1 已采纳 2015-11-18 21:32:11

解决方案3
1 2015-11-18 22:50:42

解决方案4
0 2015-11-18 23:23:30

解决方案5
0 2015-11-19 13:24:03