如何按包含 Python 列表的数据框的单元格进行分组？

Question

I am using Python and Pandas, trying to sum up, in an efficient way, dataframe's values in different rows based on lists of IDs instead of unique IDs.我正在使用 Python 和 Pandas，试图以一种有效的方式根据 ID 列表而不是唯一 ID 来总结不同行中数据帧的值。

df:

Name  -  ID  - Related IDs          - Value
z     -  123 - ['aaa','bbb','ccc']  -  10
w     -  456 - ['aaa']              -  20
y     -  789 - ['ggg','hhh','jjj']  -  50
x     -  012 - ['jjj','hhh']        -  60
r     -  015 - ['hhh']              -  15

It will be possible to try to explode each row by the element of the list but it may duplicate the values to sum and it might not be an efficient solution in terms of timing and resources.可以尝试按列表元素展开每一行，但它可能会将值复制到总和，并且在时间和资源方面可能不是有效的解决方案。

```python
f = {'Sum': 'sum'}

df = df.groupby(['Related IDs']).agg(f) 
#it is not working has is matching element wise 
#rather then by element

df = df.reset_index()
```

What I am expecting is a new column "Sum" that sum up the values "Value" of rows which have one or more Related IDs in common.我期待的是一个新列“总和”，它总结了具有一个或多个相关 ID 的行的值“值”。 As the following:如下：

Name  -  ID  - Related IDs          - Value - Sum
z     -  123 - ['aaa','bbb','ccc']  -  10  -  30
w     -  456 - ['aaa']              -  20  -  30
y     -  789 - ['ggg','hhh','jjj']  -  50  -  125
x     -  012 - ['jjj','hhh']        -  60  -  125
r     -  015 - ['hhh']              -  15  -  125

Answer 1

Use networkx with connected_components :将networkx与connected_components networkx使用：

import networkx as nx
from itertools import combinations, chain

#if necessary convert to lists 
df['Related IDs'] = df['Related IDs'].apply(ast.literal_eval)

#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in df['Related IDs']]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('aaa', 'bbb'), ('aaa', 'ccc'), ('bbb', 'ccc'), 
 ('ggg', 'hhh'), ('ggg', 'jjj'), ('hhh', 'jjj'), ('jjj', 'hhh')]

#create the graph from the dataframe
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)

#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}

#create groups by mapping first value of column Related IDs
groups = [node2id.get(x[0]) for x in df['Related IDs']]
print (groups)
[0, 0, 1, 1, 1]

#get sum to new column
df['Sum'] = df.groupby(groups)['Value'].transform('sum')
print (df)
  Name   ID      Related IDs  Value  Sum
0    z  123  [aaa, bbb, ccc]     10   30
1    w  456            [aaa]     20   30
2    y  789  [ggg, hhh, jjj]     50  125
3    x   12       [jjj, hhh]     60  125
4    r   15            [hhh]     15  125

如何按包含 Python 列表的数据框的单元格进行分组？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-06 12:05:57

如何按包含 Python 列表的数据框的单元格进行分组？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-06 12:05:57

解决方案1
1 已采纳 2019-07-06 12:05:57