[英]Group by a column with semi colon separated values in pandas data frame
Imagine a pandas
dataframe given by 想象一下由给出的
pandas
数据框
import pandas as pd
df = pd.DataFrame({
'id': range(5),
'vmns': ('nan', 'a', 'a;b', 'c', 'b')
})
which gives the following table 给出下表
id vmns
0 0 nan
1 1 a
2 2 a;b
3 3 c
4 4 b
Now I wish to group by vmns
column but note the semi colon separated value for vmns
for id = 2
. 现在,我希望按
vmns
列进行分组,但请注意id = 2
vmns
的半冒号分隔值。 This should be interpreted as either a
or b
so a link between these values are created. 应该将其解释为
a
或b
以便在这些值之间创建链接。 Hence the resulting table should look like the this 因此,结果表应如下所示
id vmns group
0 0 nan 0
1 1 a 1
2 2 a;b 1
3 3 c 2
4 4 b 1
Any suggestions? 有什么建议么?
I went ahead and created a solution using networkx
. 我继续并使用
networkx
创建了一个解决方案。 It goes as follows (extended example) 它如下(扩展示例)
import networkx as nx
import pandas as pd
df = pd.DataFrame({
'id': range(7),
'vmns': ('nan', 'a', 'a;b;c', 'c', 'b', 'd;e', 'e')
})
which yields 产生
id vmns
0 0 nan
1 1 a
2 2 a;b;c
3 3 c
4 4 b
5 5 d;e
6 6 e
Then I create nodes from rows without semicolon and edges from rows with semicolon. 然后,我从没有分号的行中创建节点,并从带有分号的行中创建边缘。 Rows with
nan
are ignored. 带有
nan
被忽略。
# determine which rows contains nodes and which contains edges
edges_mask = df['vmns'].str.contains(';')
nodes_mask = ~df['vmns'].str.contains(';') & (df['vmns'] != 'nan')
def create_pairwise_edges(lst):
return [(lst[0], value) for value in lst[1:]]
# create the graph with nodes and edges
G = nx.Graph()
G.add_nodes_from(df.loc[nodes_mask, 'vmns'])
G.add_edges_from([st for row in df.loc[edges_mask, 'vmns'].str.split(';').map(create_pairwise_edges) for st in row])
# determine the connected components and write to df
Gcc = nx.connected_components(G)
new_map = dict()
for g, ids in enumerate(Gcc):
for id in ids:
new_map[id] = g
new_map['nan'] = 'nan'
df['combined_group'] = df['vmns'].str.split(';').map(lambda x: new_map[x[0]])
The result is 结果是
id vmns combined_group
0 0 nan nan
1 1 a 0
2 2 a;b;c 0
3 3 c 0
4 4 b 0
5 5 d;e 1
6 6 e 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.