繁体   English   中英

识别列表中的元素 - 机器学习

[英]Recognition of elements in a list - machine learning


['england', 'london']
['disney', 'france']


例如,如果我们看一下这个小例子,我们发现'迪士尼','法国'经常在列表中。 随着文档/列表数量的增加,我们可能会发现'england'总是在'london'的列表中

我看过诸如元组之类的东西,但这在语言和大型文本文档中更多地发生。 这里的问题是如何识别一起出现的这些配对/三元组/ n属性。

编辑:这不只是看对。 如果你有三个字符串重复出现怎么办!


import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1


>>> scores
array([[ 3.,  1.,  2.,  1.,  0.],
       [ 1.,  2.,  1.,  0.,  1.],
       [ 2.,  1.,  2.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.]])

对角线元素告诉您变量显示总数的次数,而对角线元素告诉您多少次,两个变量一起出现。 这个矩阵显然是对称的,因此有可能对此进行优化。




import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1

diag = scores.diagonal()

key_col = (scores/diag)[:, 0]
key_col[0] = 0

points_2 = np.where(key_col > 0.5)[0]      # suppose 0.5 is the threshold 
temp_2 = np.array(points_2)[:, np.newaxis] # step 1: we identified the points that are
                                           # close to 0
inner_scores = scores[temp_2, points_2]    # step 1: we are checking if those points are
                                           # are close to each other


>>> scores
array([[ 5.,  3.,  4.,  2.,  1.], # We identified that 1 and 2 are close to 0
       [ 3.,  4.,  3.,  1.,  2.],
       [ 4.,  3.,  4.,  1.,  1.],
       [ 2.,  1.,  1.,  2.,  0.],
       [ 1.,  2.,  1.,  0.,  2.]])
>>> inner_scores
array([[ 4.,  3.],                # Testing to see whether 1 and 2 are close together
       [ 3.,  4.]])               # Since they are, we can conclude that (0,1,2) occur 
                                  # together




step1. count compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) 

step2. merge the count output by step1 use dict(this process is done in memory per partition,if you are short of memory, choose larger partition_count)

您应该构建一个倒置索引(请参阅此处 ),然后更频繁地检查哪些术语出现(即取两个术语,计算它们一起出现的次数)。 也就是说,您为每个术语记录它出现的列表。 如果字典的大小(列表中的术语)不是太大,这可能非常有效。

我会使用某种设置逻辑。 如果事情变得很大,我会把它们推向n。

100k列表确实不是那么大,特别是如果它们只是单个单词。(我只花了一周时间处理6gb版本的这个问题,超过8亿条目)。 我会更关心你有多少LISTS


import itertools

a = ['disney','england','france']
b = ['disney','japan']
c = ['england', 'london']
d = ['disney', 'france']

g = [a, b, c, d]

for i in range(2, len(g)): 
    for ii in itertools.combinations(g, i):# combinations of list g in sequences from 2 to len(g)
        rr = map(set, ii)
        ixx = None
        for ix in rr:
            if ixx == None:
                ixx = ix
            ixx = ixx & ix
        if len(ixx) > 1:
            print ixx


显然,这并不能跟踪频率。 但是,在将列表缩减为组合重复后,这很容易。



from string import split
from itertools import combinations, imap, chain
from collections import Counter
from functools import partial

data = iter([
    'disney england france',
    'disney japan',
    'england london',
    'disney france',
])  # NOTE could have been: data = open('file.txt')

split_data = imap(split, data)

pair = partial(combinations, r=2)
observed_pairs = chain.from_iterable(
    imap(pair, split_data)
sorted_observed_pairs = imap(sorted, observed_pairs)
hashable_sorted_observed_pairs = imap(tuple, sorted_observed_pairs)
pair_count = lambda: Counter(hashable_sorted_observed_pairs)  # tada!


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM