![](/img/trans.png)
[英]How to find the distinct number of elements in data frame column, in which strings contain multiple elements separated by a semi-colon
[英]Group by a column with semi colon separated values in pandas data frame
想象一下由給出的pandas
數據框
import pandas as pd
df = pd.DataFrame({
'id': range(5),
'vmns': ('nan', 'a', 'a;b', 'c', 'b')
})
給出下表
id vmns
0 0 nan
1 1 a
2 2 a;b
3 3 c
4 4 b
現在,我希望按vmns
列進行分組,但請注意id = 2
vmns
的半冒號分隔值。 應該將其解釋為a
或b
以便在這些值之間創建鏈接。 因此,結果表應如下所示
id vmns group
0 0 nan 0
1 1 a 1
2 2 a;b 1
3 3 c 2
4 4 b 1
有什么建議么?
我繼續並使用networkx
創建了一個解決方案。 它如下(擴展示例)
import networkx as nx
import pandas as pd
df = pd.DataFrame({
'id': range(7),
'vmns': ('nan', 'a', 'a;b;c', 'c', 'b', 'd;e', 'e')
})
產生
id vmns
0 0 nan
1 1 a
2 2 a;b;c
3 3 c
4 4 b
5 5 d;e
6 6 e
然后,我從沒有分號的行中創建節點,並從帶有分號的行中創建邊緣。 帶有nan
被忽略。
# determine which rows contains nodes and which contains edges
edges_mask = df['vmns'].str.contains(';')
nodes_mask = ~df['vmns'].str.contains(';') & (df['vmns'] != 'nan')
def create_pairwise_edges(lst):
return [(lst[0], value) for value in lst[1:]]
# create the graph with nodes and edges
G = nx.Graph()
G.add_nodes_from(df.loc[nodes_mask, 'vmns'])
G.add_edges_from([st for row in df.loc[edges_mask, 'vmns'].str.split(';').map(create_pairwise_edges) for st in row])
# determine the connected components and write to df
Gcc = nx.connected_components(G)
new_map = dict()
for g, ids in enumerate(Gcc):
for id in ids:
new_map[id] = g
new_map['nan'] = 'nan'
df['combined_group'] = df['vmns'].str.split(';').map(lambda x: new_map[x[0]])
結果是
id vmns combined_group
0 0 nan nan
1 1 a 0
2 2 a;b;c 0
3 3 c 0
4 4 b 0
5 5 d;e 1
6 6 e 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.