[英]Pandas dataframe groupby text value that occurs in two columns
My dataframe looks like this:我的 dataframe 看起来像这样:
v1 v2 distance
0 be belong 0.666667
4 increase decrease 0.666667
9 analyze assay 0.666667
11 bespeak circulate 0.769231
21 induce generate 0.800000
24 decrease delay 0.750000
26 cause trip 0.666667
27 isolate distinguish 0.750000
28 give infect 0.666667
29 result prove 0.800000
31 describe explain 0.714286
33 report circulate 0.666667
36 affect expose 0.666667
40 explain intercede 0.705882
41 suppress restrict 0.833333
With v1
and v2
being verbs and distance
is their similarity. v1
和v2
是动词, distance
是它们的相似性。 I want to create clusters of similar words, based on their appearance in the dataframe.我想根据它们在 dataframe 中的出现来创建相似词的集群。
For example, the word circulate appears be similar with both bespeak and report .例如,单词circulate似乎与bespeak和report相似。 So I would like to have a cluster of these 3 words.所以我想要一组这 3 个词。 Groupby doesn't help since they are string values. Groupby 没有帮助,因为它们是字符串值。 Can someone help?有人可以帮忙吗?
This seems like a graph problem.这似乎是一个图形问题。
You could try to use networkx
:您可以尝试使用networkx
:
import networkx as nx
G = nx.from_pandas_edgelist(df, 'v1', 'v2')
clusters = nx.connected_components(G)
output: output:
[{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
{'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
{'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
{'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]
As graph:如图:
Small function to plot the graph in jupyter:小 function 到 plot jupyter 中的图:
def nxplot(G):
from networkx.drawing.nx_agraph import to_agraph
A = to_agraph(G)
A.layout('dot')
A.draw('/tmp/graph.png')
from IPython.display import Image
return Image(filename='/tmp/graph.png')
The following line would select only the rows containing the string target_string
:以下行将 select 只有包含字符串target_string
的行:
rows = df[df.applymap(lambda element: element == target_string).any(axis = 1)]
Concatenate them and find the unique elements:连接它们并找到独特的元素:
cluster = pd.concat([rows[['v1', 'v2']]], axis = 1).unique()
If you want to find clusters with all the words, repeat this for all the unique elements.如果您想找到包含所有单词的集群,请对所有唯一元素重复此操作。 An inefficient example:一个低效的例子:
clusters = pd.DataFrame()
for target_string in df.v1.unique():
rows = df[df.applymap(lambda element: element == target_string).any(axis = 1)]
clusters.append(pd.concat([rows[['v1', 'v2']]], axis = 1).unique())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.