[英]How to group by dataframe's cells which contain lists in Python?
I am using Python and Pandas, trying to sum up, in an efficient way, dataframe's values in different rows based on lists of IDs instead of unique IDs.我正在使用 Python 和 Pandas,试图以一种有效的方式根据 ID 列表而不是唯一 ID 来总结不同行中数据帧的值。
df:
Name - ID - Related IDs - Value
z - 123 - ['aaa','bbb','ccc'] - 10
w - 456 - ['aaa'] - 20
y - 789 - ['ggg','hhh','jjj'] - 50
x - 012 - ['jjj','hhh'] - 60
r - 015 - ['hhh'] - 15
It will be possible to try to explode each row by the element of the list but it may duplicate the values to sum and it might not be an efficient solution in terms of timing and resources.可以尝试按列表元素展开每一行,但它可能会将值复制到总和,并且在时间和资源方面可能不是有效的解决方案。
```python
f = {'Sum': 'sum'}
df = df.groupby(['Related IDs']).agg(f)
#it is not working has is matching element wise
#rather then by element
df = df.reset_index()
```
What I am expecting is a new column "Sum" that sum up the values "Value" of rows which have one or more Related IDs in common.我期待的是一个新列“总和”,它总结了具有一个或多个相关 ID 的行的值“值”。 As the following:
如下:
Name - ID - Related IDs - Value - Sum
z - 123 - ['aaa','bbb','ccc'] - 10 - 30
w - 456 - ['aaa'] - 20 - 30
y - 789 - ['ggg','hhh','jjj'] - 50 - 125
x - 012 - ['jjj','hhh'] - 60 - 125
r - 015 - ['hhh'] - 15 - 125
Use networkx
with connected_components
:将
networkx
与connected_components
networkx
使用:
import networkx as nx
from itertools import combinations, chain
#if necessary convert to lists
df['Related IDs'] = df['Related IDs'].apply(ast.literal_eval)
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in df['Related IDs']]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('aaa', 'bbb'), ('aaa', 'ccc'), ('bbb', 'ccc'),
('ggg', 'hhh'), ('ggg', 'jjj'), ('hhh', 'jjj'), ('jjj', 'hhh')]
#create the graph from the dataframe
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)
#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
#create groups by mapping first value of column Related IDs
groups = [node2id.get(x[0]) for x in df['Related IDs']]
print (groups)
[0, 0, 1, 1, 1]
#get sum to new column
df['Sum'] = df.groupby(groups)['Value'].transform('sum')
print (df)
Name ID Related IDs Value Sum
0 z 123 [aaa, bbb, ccc] 10 30
1 w 456 [aaa] 20 30
2 y 789 [ggg, hhh, jjj] 50 125
3 x 12 [jjj, hhh] 60 125
4 r 15 [hhh] 15 125
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.