简体   繁体   English

Pandas dataframe 两列出现的groupby文本值

[英]Pandas dataframe groupby text value that occurs in two columns

My dataframe looks like this:我的 dataframe 看起来像这样:

     v1           v2        distance
0   be          belong      0.666667
4   increase    decrease    0.666667
9   analyze     assay       0.666667
11  bespeak     circulate   0.769231
21  induce      generate    0.800000
24  decrease    delay       0.750000
26  cause       trip        0.666667
27  isolate     distinguish 0.750000
28  give        infect      0.666667
29  result      prove       0.800000
31  describe    explain     0.714286
33  report      circulate   0.666667
36  affect      expose      0.666667
40  explain     intercede   0.705882
41  suppress    restrict    0.833333

With v1 and v2 being verbs and distance is their similarity. v1v2是动词, distance是它们的相似性。 I want to create clusters of similar words, based on their appearance in the dataframe.我想根据它们在 dataframe 中的出现来创建相似词的集群。

For example, the word circulate appears be similar with both bespeak and report .例如,单词circulate似乎与bespeakreport相似。 So I would like to have a cluster of these 3 words.所以我想要一组这 3 个词。 Groupby doesn't help since they are string values. Groupby 没有帮助,因为它们是字符串值。 Can someone help?有人可以帮忙吗?

This seems like a graph problem.这似乎是一个图形问题。

You could try to use networkx :您可以尝试使用networkx

import networkx as nx

G = nx.from_pandas_edgelist(df, 'v1', 'v2')

clusters = nx.connected_components(G)

output: output:

[{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
 {'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
 {'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
 {'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]

As graph:如图:

图形

Small function to plot the graph in jupyter:小 function 到 plot jupyter 中的图:

def nxplot(G):
    from networkx.drawing.nx_agraph import to_agraph
    A = to_agraph(G)
    A.layout('dot')
    A.draw('/tmp/graph.png')
    from IPython.display import Image
    return Image(filename='/tmp/graph.png')

The following line would select only the rows containing the string target_string :以下行将 select 只有包含字符串target_string的行:

rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]

Concatenate them and find the unique elements:连接它们并找到独特的元素:

cluster = pd.concat([rows[['v1', 'v2']]], axis = 1).unique()

If you want to find clusters with all the words, repeat this for all the unique elements.如果您想找到包含所有单词的集群,请对所有唯一元素重复此操作。 An inefficient example:一个低效的例子:

clusters = pd.DataFrame()
for target_string in df.v1.unique():
    rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]
    clusters.append(pd.concat([rows[['v1', 'v2']]], axis = 1).unique())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM