Pandas DataFrame：有效分解行集差异

Question

I am looking for a more pythonic/efficient (and short) way to factorize (=enumerate unique instances) of the row-wise set difference in an aggregated dataframe.我正在寻找一种更 Pythonic/更有效（和更短）的方法来分解（=枚举唯一实例）聚合 dataframe 中的逐行集差异。 The below tables illustrate what should not be too complex of a dataframe manipulation:下表说明了 dataframe 操作不应过于复杂：

product产品	product_group产品组	location地点
10 10	1 1	RoW排
11 11	1 1	US我们
12 12	1 1	CA加州
13 13	2 2	RoW排
14 14	2 2	JP J.P
15 15	2 2	US我们
16 16	3 3	FR FR
17 17	3 3	BE是
18 18	4 4	RoW排
19 19	4 4	US我们
20 20	4 4	CA加州

should be transformed, using list_locations = ['US', 'CA', 'JP', 'BE', 'FR'] into a table of the form应该使用list_locations = ['US', 'CA', 'JP', 'BE', 'FR']转换成表格的表格

product_group产品组	location_list位置列表	rest_of_world_location_list rest_of_world_location_list	rest_of_world_index rest_of_world_index
1 1	RoW, US, CA行，美国，加利福尼亚	JP, BE, FR日本、比利时、法国	RoW_1第 1 行
2 2	RoW, JP, US行，JP，美国	CA, BE, FR加利福尼亚、比利时、法国	RoW_2第 2 行
4 4	RoW, US, CA行，美国，加利福尼亚	CA, BE, FR加利福尼亚、比利时、法国	RoW_1第 1 行

such that every product group has a column rest_of_world_location_list that lists all the items from list_locations that are not part of a product group.这样每个product group都有一个列rest_of_world_location_list列出了list_locations中不属于产品组的所有项目。 Column rest_of_world_index is simply the factorization of rest_of_world_location_list .列rest_of_world_index只是rest_of_world_location_list的分解。

MWE input data: MWE输入数据：

df = pd.DataFrame(
    {
        "product": [10,11,12,13,14,15,16,17,18,19,20],
        "product_group": [1,1,1,2,2,2,3,3,4,4,4],
        "location": ['RoW', 'US', 'CA', 'RoW', 'JP', 'US', 'FR', 'BE', 'RoW', 'US', 'CA']
    }
)
list_locations = ['US', 'CA', 'JP', 'BE', 'FR']

My attempt (works, but likely too complicated):我的尝试（有效，但可能太复杂了）：

activity_with_rest_of_world_location = pd.DataFrame(df[df['location'] == 'RoW']['product_group'])
activity_with_rest_of_world_location['index'] = activity_with_rest_of_world_location.index

df_rest_of_world = df[df['product_group'].isin(activity_with_rest_of_world_location['product_group'])]
df_rest_of_world = df_rest_of_world.drop(activity_with_rest_of_world_location['index'])


df_rest_of_world_agg = pd.DataFrame(
    data = df_rest_of_world.groupby('product_group')['location'].apply(tuple))
df_rest_of_world_agg.reset_index(inplace = True)
df_rest_of_world_agg = df_rest_of_world_agg.merge(
    right = activity_with_rest_of_world_location,
    how = 'left',
    on = 'product_group'
)

df_rest_of_world_agg.set_index(keys = 'index', inplace = True)

df_rest_of_world_agg['location_rest_of_world'] = df_rest_of_world_agg.apply(
    lambda row: tuple(set(list_io_countries) - set(row['location'])), 
    axis = 1
)

df_rest_of_world_agg = df_rest_of_world_agg.dropna(subset ='location_rest_of_world')

df_rest_of_world_agg['location'] = pd.factorize(df_rest_of_world_agg['location_rest_of_world'])[0]
df_rest_of_world_agg['location'] = 'RoW_' + df_rest_of_world_agg['location'].astype(str)

Answer 1

IIUC, you can use a single pipeline with 3 steps: IIUC，您可以通过 3 个步骤使用单个管道：

world = set(list_locations)

(df.groupby('product_group', as_index=False)
   # aggregate locations as string and the rest of the from from a set difference
   .agg(**{'location_list': ('location', ', '.join),
           'rest_of_world_location_list': ('location', lambda l: ', '.join(sorted(world.difference(l))))
          })
   # filter the rows without RoW
   .loc[lambda d: d['location_list'].str.contains('RoW')]
   # add category
   .assign(rest_of_world_index=lambda d: 'RoW_'+d['location_list'].astype('category').cat.codes.add(1).astype(str)
          )
)

output: output：

   product_group location_list rest_of_world_location_list rest_of_world_index
0              1   RoW, US, CA                  BE, FR, JP               RoW_2
1              2   RoW, JP, US                  BE, CA, FR               RoW_1
3              4   RoW, US, CA                  BE, FR, JP               RoW_2

Answer 2

Solutions with set s - crete sets per groups, filter out no RoW rows, get differencies with join and last use factorize with frozenset s:具有set s 的解决方案 - 每个组的 crete 集，过滤掉没有RoW行，通过join获得差异，最后使用frozenset s 进行factorize ：

list_io_countries = ['US', 'CA', 'JP', 'BE', 'FR']
s = set(list_io_countries) 

df = df.groupby(df['product_group'])['location'].agg(set).reset_index(name='location_list')

df = (df[['RoW' in x for x in df['location_list']]]
       .assign(rest_of_world_location_list = lambda x: x['location_list'].apply(lambda x: ','.join(s - x)),
               rest_of_world_index = lambda x: pd.factorize(x['location_list'].apply(lambda x: frozenset(x - set(['RoW']))))[0] + 1,
               location_list = lambda x: x['location_list'].agg(','.join)
               )
       .assign(rest_of_world_index = lambda x: 'RoW_' + x['rest_of_world_index'].astype(str)))

print (df)
   product_group location_list rest_of_world_location_list rest_of_world_index
0              1     RoW,CA,US                    JP,BE,FR               RoW_1
1              2     RoW,JP,US                    CA,BE,FR               RoW_2
3              4     RoW,CA,US                    JP,BE,FR               RoW_1

Pandas DataFrame：有效分解行集差异

问题描述

2 个解决方案

解决方案1
0 2022-09-16 08:31:53

解决方案2
0 2022-09-16 08:54:45

Pandas DataFrame：有效分解行集差异

问题描述

2 个解决方案

解决方案1 0 2022-09-16 08:31:53

解决方案2 0 2022-09-16 08:54:45

解决方案1
0 2022-09-16 08:31:53

解决方案2
0 2022-09-16 08:54:45