简体   繁体   English

识别列表中的元素 - 机器学习

[英]Recognition of elements in a list - machine learning

So I have multiple lists: 所以我有多个列表:

['disney','england','france']
['disney','japan']
['england', 'london']
['disney', 'france']

Now I need to determine what in those lists tends to occur together. 现在我需要确定这些列表中的内容是否会一起出现。

For example if we look in this small example, we find 'disney','france' are often in the list together. 例如,如果我们看一下这个小例子,我们发现'迪士尼','法国'经常在列表中。 As the number of documents/lists increase we may find that 'england', is always in the list with 'london' 随着文档/列表数量的增加,我们可能会发现'england'总是在'london'的列表中

I've looked at things such as tuples, but this occurs more in language and large text documents. 我看过诸如元组之类的东西,但这在语言和大型文本文档中更多地发生。 The question here is how to identify these pairings/triples/n attribs that occur together. 这里的问题是如何识别一起出现的这些配对/三元组/ n属性。

EDIT: This is not just looking at pair. 编辑:这不只是看对。 What if you had three strings come up together repetedly! 如果你有三个字符串重复出现怎么办!

Maybe something like this could be a starting point: 也许这样的事情可能是一个起点:

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1

Result: 结果:

>>> scores
array([[ 3.,  1.,  2.,  1.,  0.],
       [ 1.,  2.,  1.,  0.,  1.],
       [ 2.,  1.,  2.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.]])

The diagonal element tells you how many times a variable showed up total, and of diagonal tells you how many times, two variables showed up together. 对角线元素告诉您变量显示总数的次数,而对角线元素告诉您多少次,两个变量一起出现。 This matrix is obviously symmetrical, so it might be possible optimize on that. 这个矩阵显然是对称的,因此有可能对此进行优化。

Also, if your sublists are very long (you have a lot of variables), but you don't have too many of them, you might consider using a sparse matrix. 此外,如果您的子列表很长(您有很多变量),但是您没有太多变量,则可以考虑使用稀疏矩阵。

Edit: 编辑:

Here is an idea on how to get triplets and so on. 这是关于如何获得三胞胎等的想法。

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2],
               [0,1,2,3],
               [0,1,2,4]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1


diag = scores.diagonal()

key_col = (scores/diag)[:, 0]
key_col[0] = 0

points_2 = np.where(key_col > 0.5)[0]      # suppose 0.5 is the threshold 
temp_2 = np.array(points_2)[:, np.newaxis] # step 1: we identified the points that are
                                           # close to 0
inner_scores = scores[temp_2, points_2]    # step 1: we are checking if those points are
                                           # are close to each other

Printout 打印

>>> scores
array([[ 5.,  3.,  4.,  2.,  1.], # We identified that 1 and 2 are close to 0
       [ 3.,  4.,  3.,  1.,  2.],
       [ 4.,  3.,  4.,  1.,  1.],
       [ 2.,  1.,  1.,  2.,  0.],
       [ 1.,  2.,  1.,  0.,  2.]])
>>> inner_scores
array([[ 4.,  3.],                # Testing to see whether 1 and 2 are close together
       [ 3.,  4.]])               # Since they are, we can conclude that (0,1,2) occur 
                                  # together

As I see it now, for this idea to work properly, we need a careful recursive implementation, but I hope this helps. 正如我现在看到的,为了使这个想法正常工作,我们需要一个仔细的递归实现,但我希望这会有所帮助。

您可以创建一个字典,其中键是您的分组(例如,作为单词的排序列表),并为每个出现次数保留一个计数器。

when short of memory, disk can help, I usually do it this way. 当内存不足时,磁盘可以提供帮助,我通常这样做。

step1. count compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) 

step2. merge the count output by step1 use dict(this process is done in memory per partition,if you are short of memory, choose larger partition_count)

You should build an inverted index (see here ) and then check which terms occur together more often (ie taking two terms, count how many times they occur together). 您应该构建一个倒置索引(请参阅此处 ),然后更频繁地检查哪些术语出现(即取两个术语,计算它们一起出现的次数)。 That is, you record for each term the list in which it occurs. 也就是说,您为每个术语记录它出现的列表。 This can be pretty efficient if the size of the dictionary (the terms in the lists) isn't too big. 如果字典的大小(列表中的术语)不是太大,这可能非常有效。

I would use some kind of set logic. 我会使用某种设置逻辑。 If things got big I'd push them to numpy. 如果事情变得很大,我会把它们推向n。

100k lists really aren't that huge especially if they're just single words.(I just spent a week working on 6gb version of this problem with over 800 million entries). 100k列表确实不是那么大,特别是如果它们只是单个单词。(我只花了一周时间处理6gb版本的这个问题,超过8亿条目)。 I'd be more concerned about how many LISTS you have. 我会更关心你有多少LISTS

This is obviously just a hack but it's in the direction of how I would solve this problem. 这显然只是一个黑客,但它是我将如何解决这个问题的方向。

import itertools

a = ['disney','england','france']
b = ['disney','japan']
c = ['england', 'london']
d = ['disney', 'france']

g = [a, b, c, d]

for i in range(2, len(g)): 
    for ii in itertools.combinations(g, i):# combinations of list g in sequences from 2 to len(g)
        rr = map(set, ii)
        ixx = None
        for ix in rr:
            if ixx == None:
                ixx = ix
                continue
            ixx = ixx & ix
        if len(ixx) > 1:
            print ixx

result: set(['disney', 'france']) 结果:set(['disney','france'])

Obviously this doesn't keep track of the frequencies. 显然,这并不能跟踪频率。 But that's easy after you've reduced the lists to combinational repeats. 但是,在将列表缩减为组合重复后,这很容易。

I've assumed that you're interested in the relationship that span lists.. If you're not then I don't understand your question, nor why you have multiple lists. 我假设你对跨越列表的关系感兴趣..如果你不是那么我不理解你的问题,也不知道为什么你有多个列表。

Simple solution counting pairs (in pure python) 简单的解决方案计数对(在纯python中)

from string import split
from itertools import combinations, imap, chain
from collections import Counter
from functools import partial

data = iter([
    'disney england france',
    'disney japan',
    'england london',
    'disney france',
])  # NOTE could have been: data = open('file.txt')

split_data = imap(split, data)

pair = partial(combinations, r=2)
observed_pairs = chain.from_iterable(
    imap(pair, split_data)
)
sorted_observed_pairs = imap(sorted, observed_pairs)
hashable_sorted_observed_pairs = imap(tuple, sorted_observed_pairs)
pair_count = lambda: Counter(hashable_sorted_observed_pairs)  # tada!
print(pair_count())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM