![](/img/trans.png)
[英]Groupby two columns and create a new column based on a conditional subtraction in python
[英]How to groupby multiple columns and create a new column in Python based on thresholds
我有如下所示的數據框
輸入
Invoice No Date Text Vendor Days
1000001 1/1/2020 Rent Payment A 0
1000003 2/1/2020 Rent Payment A 1
1000005 4/1/2020 Rent Payment A 2
1000007 6/1/2020 Water payment A 2
1000008 9/2/2020 Rep Payment A 34
1000010 9/2/2020 Car Payment A 0
1000011 10/2/2020 Car Payment A 1
1000012 15/2/2020 Car Payment A 5
1000013 16/2/2020 Car Payment A 1
1000015 17/2/2020 Car Payment A 1
1000002 1/1/2020 Rent Payment B -47
1000004 4/1/2020 Con Payment B 3
1000006 6/1/2020 Con Payment B 2
1000009 9/2/2020 Water payment B 34
1000014 17/2/2020 Test Payment B 8
1000016 19/2/2020 Test Payment B 2
健康)狀況
如何編寫python條件來檢查描述、供應商名稱和天數列,如果描述、供應商名稱相同並且天數<=2,那么這些行應該在通用組名下分組在一起,比如(G1)所有的其他行可以分配一個唯一的組名。所有分組的行都應該有唯一的組名,如輸出所示
預期產出
Invoice No Date Text Vendor Days Group
1000001 1/1/2020 Rent Payment A 0 G1
1000003 2/1/2020 Rent Payment A 1 G1
1000005 4/1/2020 Rent Payment A 2 G1
1000007 6/1/2020 Water payment A 2 G2
1000008 9/2/2020 Rep Payment A 34 G3
1000010 9/2/2020 Car Payment A 0 G4
1000011 10/2/2020 Car Payment A 1 G4
1000012 15/2/2020 Car Payment A 5 G5
1000013 16/2/2020 Car Payment A 1 G5
1000015 17/2/2020 Car Payment A 1 G5
1000002 1/1/2020 Rent Payment B -47 G6
1000004 4/1/2020 Con Payment B 3 G7
1000006 6/1/2020 Con Payment B 2 G7
1000009 9/2/2020 Water payment B 34 G8
1000014 17/2/2020 Test Payment B 8 G9
1000016 19/2/2020 Test Payment B 2 G9
您需要在三個項目上使用groupby
: 'Text'
、 'Vendor'
,以及一個布爾表示,表示在由['Text', 'Vendor']
單獨定義的組內'Days'
變化是否超過2
。
之后,您需要命名唯一的組。 我在下面提供了兩種方法。
ngroup
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = df.groupby(['Text', 'Vendor', d], sort=False).ngroup()
df.assign(Group=g.add(1).astype(str).radd('G'))
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
factorize
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = pd.factorize([*zip(df.Text, df.Vendor, d)])[0]
df.assign(Group=[f'G{i + 1}' for i in g])
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
# The first element of group Cumulatively summing True/False
# will get NaN so we fill it will create a new value every time
# in with 0 ║ we see a True. This creates groups
# ║ ║
# adjacent differences Should be obvious
# ╭─┴──╮ ╭───╨───╮ ╭─┴─╮ ╭───╨──╮
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
您可以將您的條件組合到groupby
並使用ngroup
。
df['Group'] = df['Group'] = (df.groupby([df['Description'].ne(df['Description'].shift()).cumsum(),
df['Vendor'].ne(df['Vendor'].shift()).cumsum(),
df['Days']<=2]).ngroup()+1)
.astype(str).str.pad(2, 'left','G')
# same description : df['Description'].ne(df['Description'].shift()).cumsum()
# same vendor : df['Vendor'].ne(df['Vendor'].shift()).cumsum()
# Days<=2 : df['Days']<=2
輸出:
Invoice No Date Description Vendor Days Group
0 123456 2020-01-01 Rent Payment A 0 G1
1 123457 2020-02-01 Rent Payment A 1 G1
2 123458 2020-04-01 Rent Payment A 2 G1
3 123459 2020-06-01 Water Payment A 2 G2
4 123460 2020-09-02 Rent Payment A 34 G3
5 123461 2020-09-02 Rep Payment A 0 G4
6 123462 2020-10-02 Rep Payment A 1 G4
7 123463 2020-11-02 Rep Payment A 2 G4
8 123464 2020-02-20 Water Payment A 11 G5
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.