简体   繁体   English

如何根据另一列计算两列中的唯一值? (每个ID)

[英]how to count unique values from two columns based on another column? (per ID)

I have 6 million transaction data so I need some functions to run this fast. 我有600万个交易数据,所以我需要一些功能来快速运行。 Basically, I have unique customer IDs and the car class they reserved and actually drove at the end. 基本上,我有唯一的客户ID和他们保留的汽车类,并且实际上最终开车。 Customers may have one or more renting car experiences. 客户可能有一个或多个租车体验。 For a specific customer at each time point, I want to calculate how many unique different car class experiences he/she has, combining unique car class (reserved and drove) 对于每个时间点的特定客户,我想计算他/她拥有多少独特的不同汽车级体验,结合独特的汽车级别(保留和开车)

In fact, my data is even not in this order, which means that the id and the dates are unsorted. 实际上,我的数据甚至不按此顺序排列,这意味着id和日期未排序。 The layout showed below is for convenience. 以下显示的布局是为了方便起见。 It would be nice if you can also handle the unsorted problem! 如果您还可以处理未排序的问题,那将是很好的!

Thank you! 谢谢!

The data looks like this: 数据如下所示:

id  date reserved drove
1   2017    A       B
1   2018    B       A
1   2019    A       C
2   2017    A       B
2   2018    C       D
3   2018    D       D

I want this result: 我想要这个结果:

id  date  experience
1   2017     2 #(A+B)
1   2018     2 #still the same as 2017 because this customer just experienced A and B (A+B)
1   2019     3 #one more experience because C is new car class (A+B+C)
2   2017     2 #(A+B)
2   2018     4 #(A+B+C+D)
3   2018     1 #(D)

How about this? 这个怎么样? Uses list comprehension since pandas DF isn't great for dealing with sets (which is what this problem ultimately is). 使用列表理解,因为pandas DF不适合处理集合(这是最终的问题)。

df = pd.DataFrame([
    [1, 2017, 'a', 'b'],
    [1, 2018, 'a', 'b'],
    [1, 2019, 'a', 'c'],
    [2, 2017, 'a', 'b'],
    [2, 2018, 'c', 'd'],
    [3, 2018, 'd', 'd'],
], columns=['id', 'date', 'reserved', 'drove'])

list_of_sets = [(v[0], v[1], {v[2], v[3]}) for v in df.values]

sorted_list = sorted(list_of_sets)  # not necc if sorted before

result = pd.DataFrame([
    (info[0], info[1], len(info[2].union(sorted_list[i-1][2])))
    if info[0] == sorted_list[i-1][0] 
    else (info[0], info[1], len(info[2]))
    for i, info in enumerate(sorted_list)
], columns=['id', 'date', 'count'])

Here's a numpy based approach: 这是一个基于numpy的方法:

import numpy as np
# sort values column-wise
df[['reserved','drove']] = np.sort(df[['reserved','drove']])
# sort values by id, reserved and drove
df = df.sort_values(['id','reserved','drove'])

And now lets define some conditions with which to obtain the expected output: 现在让我们定义一些条件来获得预期的输出:

# Does the id change?
c1 = df.id.ne(df.id.shift()).values
# is the next row the same? (for each col individually)
c2 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1))).values
# Is the value in "drove" the same?
c3 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1, axis=1))).values

df['experience'] = ((c2 + c1[:,None]) * c3).sum(1)
df = df[['id','date']].assign(experience = df.groupby('id').experience.cumsum())

print(df)

   id  date  experience
0   1  2017           2
1   1  2018           2
2   1  2019           3
3   2  2017           2
4   2  2018           4
5   3  2018           1

It can be done with two lines (and I'm pretty sure someone can pull it off in one line): 它可以用两行完成(我很确定有人可以在一行中完成它):
Create a list of all observed values for both reserved and drove and then count the contents (using cumsum) 为保留和驱动创建所有观察值的列表,然后计算内容(使用cumsum)

df['aux'] = list(map(list, zip(df.reserved, df.drove)))
df['aux_cum'] = [len(set(x)) for x in df.groupby('id')['aux'].apply(lambda x: x.cumsum())]

Output: 输出:

   id  date reserved drove     aux  aux_cum
0   1  2017        A     B  [A, B]        2
1   1  2018        B     A  [B, A]        2
2   1  2019        A     C  [A, C]        3
3   2  2017        A     B  [A, B]        2
4   2  2018        C     D  [C, D]        4
5   3  2018        D     D  [D, D]        1

Pretty format: 漂亮的格式:

print(df.drop(['reserved','drove','aux'], axis=1)

   id  date  aux_cum
0   1  2017        2
1   1  2018        2
2   1  2019        3
3   2  2017        2
4   2  2018        4
5   3  2018        1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM