简体   繁体   English

在熊猫数据框中按具有半冒号分隔值的列分组

[英]Group by a column with semi colon separated values in pandas data frame

Imagine a pandas dataframe given by 想象一下由给出的pandas数据框

import pandas as pd

df = pd.DataFrame({
    'id': range(5),
    'vmns': ('nan', 'a', 'a;b', 'c', 'b')
})

which gives the following table 给出下表

   id vmns
0   0  nan
1   1    a
2   2  a;b
3   3    c
4   4    b

Now I wish to group by vmns column but note the semi colon separated value for vmns for id = 2 . 现在,我希望按vmns列进行分组,但请注意id = 2 vmns的半冒号分隔值。 This should be interpreted as either a or b so a link between these values are created. 应该将其解释为ab以便在这些值之间创建链接。 Hence the resulting table should look like the this 因此,结果表应如下所示

   id vmns  group
0   0  nan      0
1   1    a      1
2   2  a;b      1
3   3    c      2
4   4    b      1

Any suggestions? 有什么建议么?

I went ahead and created a solution using networkx . 我继续并使用networkx创建了一个解决方案。 It goes as follows (extended example) 它如下(扩展示例)

import networkx as nx
import pandas as pd

df = pd.DataFrame({
    'id': range(7),
    'vmns': ('nan', 'a', 'a;b;c', 'c', 'b', 'd;e', 'e')
})

which yields 产生

   id   vmns
0   0    nan
1   1      a
2   2  a;b;c
3   3      c
4   4      b
5   5    d;e
6   6      e

Then I create nodes from rows without semicolon and edges from rows with semicolon. 然后,我从没有分号的行中创建节点,并从带有分号的行中创建边缘。 Rows with nan are ignored. 带有nan被忽略。

# determine which rows contains nodes and which contains edges
edges_mask = df['vmns'].str.contains(';')
nodes_mask = ~df['vmns'].str.contains(';') & (df['vmns'] != 'nan')

def create_pairwise_edges(lst):
    return [(lst[0], value) for value in lst[1:]]

# create the graph with nodes and edges
G = nx.Graph()
G.add_nodes_from(df.loc[nodes_mask, 'vmns'])
G.add_edges_from([st for row in df.loc[edges_mask, 'vmns'].str.split(';').map(create_pairwise_edges) for st in row])

# determine the connected components and write to df
Gcc = nx.connected_components(G)
new_map = dict()
for g, ids in enumerate(Gcc):
    for id in ids:
        new_map[id] = g
new_map['nan'] = 'nan'
df['combined_group'] = df['vmns'].str.split(';').map(lambda x: new_map[x[0]])

The result is 结果是

   id   vmns combined_group
0   0    nan            nan
1   1      a              0
2   2  a;b;c              0
3   3      c              0
4   4      b              0
5   5    d;e              1
6   6      e              1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在数据框列中找到不同数量的元素,其中字符串包含由分号分隔的多个元素 - How to find the distinct number of elements in data frame column, in which strings contain multiple elements separated by a semi-colon 将 pandas 数据框列值转换为逗号分隔的字符串 - convert pandas data frame column values into comma separated strings Pandas 数据框 - 对列值进行分组,然后随机化该列的新值 - Pandas data frame - Group a column values then Randomize new values of that column 如何在pyspark中将字符串分号分隔的列转换为MapType? - How to convert string semi colon-separated column to MapType in pyspark? 如何查找存储在 pandas 数据框列中的逗号分隔字符串中唯一值的数量? - How to find the number of unique values in comma separated strings stored in an pandas data frame column? 如何使用 python 分隔 pandas 数据帧中的嵌套逗号分隔列值? - How to separate nested comma separated column values in pandas data frame using python? Pandas按组中所有值的总和与另一列以逗号分隔 - Pandas Group by sum of all the values of the group and another column as comma separated 用pandas和numpy解析冒号分隔的稀疏数据 - Parsing colon separated sparse data with pandas and numpy 熊猫分组数据框并按列值排序 - Pandas group data frame and sort by column value 如何按Pandas数据框中的列值进行分组 - How to Group by column value in Pandas Data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM