简体   繁体   English

基于优先级过滤 pandas DataFrame 的高效/Pythonic方法

[英]Efficient/Pythonic way to Filter pandas DataFrame based on priority

I have below dataframe.我有以下 dataframe。

+-----------+----------+-----+
| InvoiceNo | ItemCode | Qty |
+-----------+----------+-----+
|  Inv-001  |     A    |  2  |
+-----------+----------+-----+
|  Inv-001  |     B    |  3  |
+-----------+----------+-----+
|  Inv-001  |     C    |  1  |
+-----------+----------+-----+
|  Inv-002  |     B    |  3  |
+-----------+----------+-----+
|  Inv-002  |     D    |  4  |
+-----------+----------+-----+
|  Inv-003  |     C    |  3  |
+-----------+----------+-----+
|  Inv-003  |     D    |  9  |
+-----------+----------+-----+
|  Inv-004  |     D    |  5  |
+-----------+----------+-----+
|  Inv-004  |     E    |  8  |
+-----------+----------+-----+
|  Inv-005  |     X    |  2  |
+-----------+----------+-----+

my task is to create an additional column Type based on the priority of the item occurrence.我的任务是根据项目出现的优先级创建一个额外的列Type

eg: ItemCode A has 1st Priority.例如: ItemCode A具有1st优先级。 then B has 2nd priority and C has 3rd priority.然后B具有2nd优先级, C具有3rd优先级。 rest of the items has least priority and classified has Other . rest 的项目优先级least ,分类有Other

So, if any Invoice contains item A , the type should be Type - A irrespective other items presence.因此,如果任何 Invoice 包含项目A ,则类型应为Type - A而与其他项目无关。 from the balance Invoices if item B contains, then the type should be Type - B .从余额 Invoices 中,如果项目B包含,则类型应为Type - B same for C . C相同。 if none of A, B or C is not present in any invoice, then the type should be Type - Other .如果任何发票中都不存在A, B or C ,则类型应为Type - Other

Below is my desired output.下面是我想要的 output。

+-----------+----------+-----+--------------+
| InvoiceNo | ItemCode | Qty |     Type     |
+-----------+----------+-----+--------------+
|  Inv-001  |     A    |  2  |   Type - A   |
+-----------+----------+-----+--------------+
|  Inv-001  |     B    |  3  |   Type - A   |
+-----------+----------+-----+--------------+
|  Inv-001  |     C    |  1  |   Type - A   |
+-----------+----------+-----+--------------+
|  Inv-002  |     B    |  3  |   Type - B   |
+-----------+----------+-----+--------------+
|  Inv-002  |     D    |  4  |   Type - B   |
+-----------+----------+-----+--------------+
|  Inv-003  |     C    |  3  |   Type - C   |
+-----------+----------+-----+--------------+
|  Inv-003  |     D    |  9  |   Type - C   |
+-----------+----------+-----+--------------+
|  Inv-004  |     D    |  5  | Type - Other |
+-----------+----------+-----+--------------+
|  Inv-004  |     E    |  8  | Type - Other |
+-----------+----------+-----+--------------+
|  Inv-005  |     X    |  2  | Type - Other |
+-----------+----------+-----+--------------+

Below is my code and it works.下面是我的代码,它可以工作。 But, it is more cumbersome and not pythonic at all.但是,它更麻烦而且根本不是pythonic

# load Dataframe
df = pd.read_excel() 

# filter data containing `A`
mask_A = (df['ItemCode'] == 'A').groupby(df['InvoiceNo']).transform('any')
df_A = df[mask_A]
df_A['Type'] = 'Type - A'

# form the rest of the data, filter data containing `B`
df = df[~mask_A]
mask_B = (df['ItemCode'] == 'B').groupby(df['InvoiceNo']).transform('any')
df_B = df[mask_B]
df_B['Type'] = 'Type - B'

# form the rest of the data, filter data containing `c`
df = df[~mask_B]
mask_C = (df['ItemCode'] == 'C').groupby(df['InvoiceNo']).transform('any')
df_C = df[mask_C]
df_C['Type'] = 'Type - C'

# form the rest of the data, filter data doesnt contain `A, B or C`
df_Other = df[~mask_C]
df_Other['Type'] = 'Type - Other'

# Conctenate all the dataframes
df = pd.concat([df_A, df_B, df_C, df_Other], axis=0,sort=False)

Now, what is the most efficient and pythonic way to do this?现在,最efficient和最pythonic的方法是什么?

I feel like we can do Categorical then transform我觉得我们可以做Categorical然后transform

df['Type']=pd.Categorical(df.ItemCode,['A','B','C'],ordered=True)

df['Type']='Type_'+df.groupby('InvoiceNo')['Type'].transform('min').fillna('other')

Update更新

df['Type']=pd.Categorical(df.ItemCode,['A','B','C'],ordered=True)
df=df.sort_values('Type')
df['Type']='Type_'+df.groupby('InvoiceNo')['Type'].transform('first').fillna('other')
df=df.sort_index()

df
Out[32]: 
     InvoiceNo ItemCode  Qty        Type
0    Inv-001          A    2      Type_A
1    Inv-001          B    3      Type_A
2    Inv-001          C    1      Type_A
3    Inv-002          B    3      Type_B
4    Inv-002          D    4      Type_B
5    Inv-003          C    3      Type_C
6    Inv-003          D    9      Type_C
7    Inv-004          D    5  Type_other
8    Inv-004          E    8  Type_other
9    Inv-005          X    2  Type_other

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM