简体   繁体   English

根据键计算多行的标志

[英]Count flags across multiple rows depending on key

I have a dataset that consists of customers and accounts where a customer can have multiple accounts. 我有一个由客户和帐户组成的数据集,客户可以拥有多个帐户。 The dataset has several 'flags' on each account. 数据集在每个帐户上都有几个“标志”。

I'm trying to get a count of 'unique' hits on these flags per customer, ie if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts. 我试图得到每个客户对这些标志的“唯一”点击次数,即如果3个帐户有flag1我想要计为1次点击,但如果其中一个帐户也有flag2我也希望这个算作2.基本上,我想看看每个客户在他们所有帐户中点击了多少个标记。

Example Input data frame:
    cust  acct flag1 flag2 flag3
    a     123    0    1      0
    a     456    1    1      0
    b     789    1    1      1
    c     428    0    1      0
    c     247    0    1      0
    c     483    0    1      1
Example Output dataframe:
    cust  acct flag1 flag2 flag3 UniqueSum
    a     123    0    1      0      2
    a     456    1    1      0      2
    b     789    1    1      1      3
    c     428    0    1      0      2
    c     247    0    1      0      2
    c     483    0    1      1      2

I've tried to use the following: 我试过使用以下内容:

fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))

but this sums the acct column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts. 但这总结了acct列,每个客户给我一行,我希望与客户拥有相同数量的行。

Using data.table : 使用data.table

require(data.table) # v1.9.6
dt[, un := sum(sapply(.SD, max)), by = cust, .SDcols = flag1:flag3]

We group by cust , and on the subdata for each group for columns flag1, flag2, flag3 (achieved using .SD and .SDcols ), we extract each column's max , and summing it up would give the total number of 1 's. 我们按cust分组,并且对于列flag1, flag2, flag3 (使用.SD.SDcols实现)的每个组的子数据 ,我们提取每列的max ,并将其求和将得到1的总数。

We update the original table with these values by reference using the LHS := RHS notation (see Reference Semantics vignette). 我们使用LHS := RHS表示法通过引用更新原始表和这些值(请参阅参考语义晕影)。


where dt is: 其中dt是:

dt = fread('cust  acct flag1 flag2 flag3
a     123    0    1      0
a     456    1    1      0
b     789    1    1      1
c     428    0    1      0
c     247    0    1      0
c     483    0    1      1')

One way that comes to my mind, is to colSum for each cust and check which are greater than 0. For example, 我想到的一种方法是每个cust colSum和大于0的检查。例如,

> tab
  cust acct flag1 flag2 flag3
1    a  123     0     1     0
2    a  456     1     1     0
3    b  789     1     1     1
4    c  428     0     1     0
5    c  247     0     1     0
6    c  483     0     1     1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
  cust acct flag1 flag2 flag3 uniqueSums
1    a  123     0     1     0          2
2    a  456     1     1     0          2
3    b  789     1     1     1          3
4    c  428     0     1     0          2
5    c  247     0     1     0          2
6    c  483     0     1     1          2

For each value of cust , the function in sapply finds the rows, does a vectorized sum and checks for values that are greater than 0. 对于cust每个值, sapply的函数查找行,执行向量化求和并检查大于0的值。

Here's an approach using library(dplyr) : 这是使用library(dplyr)

df %>% 
  group_by(cust) %>% 
  summarise_each(funs(max), -acct) %>% 
  mutate(UniqueSum = rowSums(.[-1])) %>% 
  select(-starts_with("flag")) %>% 
  right_join(df, "cust")

#Source: local data frame [6 x 6]
#
#    cust UniqueSum  acct flag1 flag2 flag3
#  (fctr)     (dbl) (int) (int) (int) (int)
#1      a         2   123     0     1     0
#2      a         2   456     1     1     0
#3      b         3   789     1     1     1
#4      c         2   428     0     1     0
#5      c         2   247     0     1     0
#6      c         2   483     0     1     1

I was able to answer my own question after reading Roman's post, I did something like this where f data is my dataset. 在阅读了Roman的帖子之后,我能够回答我自己的问题,我做了类似这样的事情,其中​​f数据是我的数据集。

fSumData <- ddply(fData, "cust", numcolwise(sum))
fSumData$UniqueHits <- ifelse(fSumData$flag1 >= 1;1,0) + ifelse(fSumData$flag2 >= 1;1;0) + ifelse(fSumData$flag3 >= 1;1;0)

I found this to be a bit faster than Roman's solution when running against my dataset, but am unsure if it's the optimal solution. 在针对我的数据集运行时,我发现这比Roman的解决方案快一点,但我不确定它是否是最佳解决方案。 Thank you all for your input this helped a ton! 谢谢大家的帮助,这对你有所帮助!

The underused rowsum could be, also, of use: 未充分利用的rowsum也可以使用:

rowSums(rowsum(DF[-(1:2)], DF$cust) > 0)[DF$cust]
#a a b c c c 
#2 2 3 2 2 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM