简体   繁体   English

单词聚类列表列表

[英]List of lists of words clustering

Let's say I have a list of lists of words, for example 假设我有一个单词列表列表,例如

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

The set is much bigger. 集合更大。 I want to cluster the words that words usually existing together will have the same cluster. 我想将通常一起存在的单词具有相同的群集。 So in this case the clusters will be ['apple', 'banana', 'orange'] and ['rice','potatoes'] . 因此,在这种情况下,群集将为['apple', 'banana', 'orange']['rice','potatoes']
What is the best approach to archive this kind of clustering? 归档此类群集的最佳方法是什么?

I think it is more natural to think of the problem as a graph. 我认为将问题视为图表更自然。

You can assume for example that apple is node 0, and banana is node 1 and the first list indicates there is an edge between 0 to 1. 例如,您可以假设apple是节点0, banana是节点1,第一个列表表示0到1之间存在边。

so first convert the labels to numbers: 因此,首先将标签转换为数字:

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

now: 现在:

l=[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes'], #I deleted orange as edge is between 2 points, you can  transform the triple to 3 pairs or think of different solution
 ['potatoes','rice']]

convert the labels to numbers: 将标签转换为数字:

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

now, start to build the graph and add the edges: 现在,开始构建图形并添加边:

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

now you can use the connected_component_subgraphs function to analyze connected vertices. 现在,您可以使用connected_component_subgraphs函数来分析连接的顶点。

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

output: 输出:

{0: [0, 1, 2], 1: [3, 4]} {0:[0,1,2],1:[3,4]}

or 要么

print([le.inverse_transform(v) for v in comp_dict.values()])

output: 输出:

[array(['apple', 'banana', 'orange']), array(['potatoes', 'rice'])] [array(['apple','banana','orange']),array(['potatoes','rice'])]

and those are your 2 clusters. 那是你的两个集群。

It will be more meaningful to look for frequent itemsets instead. 相反,寻找频繁的项目集将更有意义。

If you cluster such short sets of words, everything will be connected at usually just a few levels: nothing in common, one element in common, two elements in common. 如果将这些简短的单词集聚在一起,则所有内容通常仅在几个级别上连接:没有什么共同之处,没有一个共同之处,没有两个共同之处。 That is too coarse to be usable for clustering. 这太粗糙了,无法用于群集。 You'll get everything or nothing connected, and i0the results may be highly sensitive to data changes and ordering. 您将获得一切连接或无连接,结果可能对数据更改和排序高度敏感。

So abandoned the paradigm of partitioning the data - look for frequent combinations instead. 因此放弃了分区数据的范例-而是寻找频繁的组合。

So, after lots of Googling around, I figured out that I, in fact, can't use clustering techniques because I lack feature variables on which I can cluster the words. 因此,经过大量的谷歌搜索,我发现实际上我不能使用聚类技术,因为我缺少可以对单词进行聚类的特征变量。 If I make a table where I note how often each word exists with other words (in fact cartesian product) is in fact adjacency matrix and clustering doesn't work well on it. 如果我在一张桌子上记下每个单词与其他单词(实际上是笛卡尔积)存在的频率实际上是邻接矩阵,那么聚类就不能很好地工作了。

So, the solution I was looking for is graph community detection. 因此,我正在寻找的解决方案是图形社区检测。 I used igraph library (or the python's python-ipgraph wrapper) to find the clusters and it runs very well and fast. 我使用igraph库(或python的python-ipgraph包装器)来查找群集,并且它运行得非常好且很快。

More informations: 更多信息:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM