简体   繁体   English

如何按包含 Python 列表的数据框的单元格进行分组?

[英]How to group by dataframe's cells which contain lists in Python?

I am using Python and Pandas, trying to sum up, in an efficient way, dataframe's values in different rows based on lists of IDs instead of unique IDs.我正在使用 Python 和 Pandas,试图以一种有效的方式根据 ID 列表而不是唯一 ID 来总结不同行中数据帧的值。

df:

Name  -  ID  - Related IDs          - Value
z     -  123 - ['aaa','bbb','ccc']  -  10
w     -  456 - ['aaa']              -  20
y     -  789 - ['ggg','hhh','jjj']  -  50
x     -  012 - ['jjj','hhh']        -  60
r     -  015 - ['hhh']              -  15

It will be possible to try to explode each row by the element of the list but it may duplicate the values to sum and it might not be an efficient solution in terms of timing and resources.可以尝试按列表元素展开每一行,但它可能会将值复制到总和,并且在时间和资源方面可能不是有效的解决方案。

```python
f = {'Sum': 'sum'}

df = df.groupby(['Related IDs']).agg(f) 
#it is not working has is matching element wise 
#rather then by element

df = df.reset_index()
```

What I am expecting is a new column "Sum" that sum up the values "Value" of rows which have one or more Related IDs in common.我期待的是一个新列“总和”,它总结了具有一个或多个相关 ID 的行的值“值”。 As the following:如下:

Name  -  ID  - Related IDs          - Value - Sum
z     -  123 - ['aaa','bbb','ccc']  -  10  -  30
w     -  456 - ['aaa']              -  20  -  30
y     -  789 - ['ggg','hhh','jjj']  -  50  -  125
x     -  012 - ['jjj','hhh']        -  60  -  125
r     -  015 - ['hhh']              -  15  -  125

Use networkx with connected_components :networkxconnected_components networkx使用:

import networkx as nx
from itertools import combinations, chain

#if necessary convert to lists 
df['Related IDs'] = df['Related IDs'].apply(ast.literal_eval)

#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in df['Related IDs']]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('aaa', 'bbb'), ('aaa', 'ccc'), ('bbb', 'ccc'), 
 ('ggg', 'hhh'), ('ggg', 'jjj'), ('hhh', 'jjj'), ('jjj', 'hhh')]

#create the graph from the dataframe
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)

#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}

#create groups by mapping first value of column Related IDs
groups = [node2id.get(x[0]) for x in df['Related IDs']]
print (groups)
[0, 0, 1, 1, 1]

#get sum to new column
df['Sum'] = df.groupby(groups)['Value'].transform('sum')
print (df)
  Name   ID      Related IDs  Value  Sum
0    z  123  [aaa, bbb, ccc]     10   30
1    w  456            [aaa]     20   30
2    y  789  [ggg, hhh, jjj]     50  125
3    x   12       [jjj, hhh]     60  125
4    r   15            [hhh]     15  125

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从包含列表的字典列表中获取扁平 dataframe? - How to get a flatten dataframe from lists of dictionaries which contain lists? 如何按包含 python 中的列表的值对字典进行排序 - How to sort a dictionary by values which contain lists in python 将 Dataframe 转换为某些单元格包含列表的字典 - Convert Dataframe to Dictionary where Some Cells Contain Lists 如何减去两个列表,其中一个列表包含重采样方法的值,另一个列表包含 dataframe 中的正常值? - How to substract the two lists which one list contain the value from a resample method and another list is contain normal values in dataframe? 用 np.nan 替换 dataframe 中包含“...”的所有单元格 - Replace all cells in a dataframe which contain '…' with np.nan 如何将作为列表列表值的字典转换为python中的数据框? - How to convert dictionary which as values as list of lists into dataframe in python? 有没有办法对包含列表的 dataframe 列进行排序? - Is there any way to sort the dataframe column which contain lists? 从包含列表的嵌套字典创建数据框 - Create dataframe from nested dictionaries which contain lists 熊猫数据框单元格中的嵌套列表如何提取? - Nested lists in cells of pandas dataframe, how to extract? 如何断言相等的 2 个包含字典的列表? - How to assert equal 2 lists which contain dicts?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM