識別列表中的元素 - 機器學習

Question

所以我有多個列表：

['disney','england','france']
['disney','japan']
['england', 'london']
['disney', 'france']

現在我需要確定這些列表中的內容是否會一起出現。

例如，如果我們看一下這個小例子，我們發現'迪士尼'，'法國'經常在列表中。 隨着文檔/列表數量的增加，我們可能會發現'england'總是在'london'的列表中

我看過諸如元組之類的東西，但這在語言和大型文本文檔中更多地發生。 這里的問題是如何識別一起出現的這些配對/三元組/ n屬性。

編輯：這不只是看對。 如果你有三個字符串重復出現怎么辦！

Answer 1

也許這樣的事情可能是一個起點：

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1

結果：

>>> scores
array([[ 3.,  1.,  2.,  1.,  0.],
       [ 1.,  2.,  1.,  0.,  1.],
       [ 2.,  1.,  2.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.]])

對角線元素告訴您變量顯示總數的次數，而對角線元素告訴您多少次，兩個變量一起出現。 這個矩陣顯然是對稱的，因此有可能對此進行優化。

此外，如果您的子列表很長（您有很多變量），但是您沒有太多變量，則可以考慮使用稀疏矩陣。

編輯：

這是關於如何獲得三胞胎等的想法。

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2],
               [0,1,2,3],
               [0,1,2,4]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1


diag = scores.diagonal()

key_col = (scores/diag)[:, 0]
key_col[0] = 0

points_2 = np.where(key_col > 0.5)[0]      # suppose 0.5 is the threshold 
temp_2 = np.array(points_2)[:, np.newaxis] # step 1: we identified the points that are
                                           # close to 0
inner_scores = scores[temp_2, points_2]    # step 1: we are checking if those points are
                                           # are close to each other

打印

>>> scores
array([[ 5.,  3.,  4.,  2.,  1.], # We identified that 1 and 2 are close to 0
       [ 3.,  4.,  3.,  1.,  2.],
       [ 4.,  3.,  4.,  1.,  1.],
       [ 2.,  1.,  1.,  2.,  0.],
       [ 1.,  2.,  1.,  0.,  2.]])
>>> inner_scores
array([[ 4.,  3.],                # Testing to see whether 1 and 2 are close together
       [ 3.,  4.]])               # Since they are, we can conclude that (0,1,2) occur 
                                  # together

正如我現在看到的，為了使這個想法正常工作，我們需要一個仔細的遞歸實現，但我希望這會有所幫助。

Answer 2

您可以創建一個字典，其中鍵是您的分組（例如，作為單詞的排序列表），並為每個出現次數保留一個計數器。

Answer 3

當內存不足時，磁盤可以提供幫助，我通常這樣做。

step1. count compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) 

step2. merge the count output by step1 use dict(this process is done in memory per partition,if you are short of memory, choose larger partition_count)

Answer 4

您應該構建一個倒置索引（請參閱此處），然后更頻繁地檢查哪些術語出現（即取兩個術語，計算它們一起出現的次數）。 也就是說，您為每個術語記錄它出現的列表。 如果字典的大小（列表中的術語）不是太大，這可能非常有效。

Answer 5

我會使用某種設置邏輯。 如果事情變得很大，我會把它們推向n。

100k列表確實不是那么大，特別是如果它們只是單個單詞。（我只花了一周時間處理6gb版本的這個問題，超過8億條目）。 我會更關心你有多少LISTS 。

這顯然只是一個黑客，但它是我將如何解決這個問題的方向。

import itertools

a = ['disney','england','france']
b = ['disney','japan']
c = ['england', 'london']
d = ['disney', 'france']

g = [a, b, c, d]

for i in range(2, len(g)): 
    for ii in itertools.combinations(g, i):# combinations of list g in sequences from 2 to len(g)
        rr = map(set, ii)
        ixx = None
        for ix in rr:
            if ixx == None:
                ixx = ix
                continue
            ixx = ixx & ix
        if len(ixx) > 1:
            print ixx

結果：set（['disney'，'france']）

顯然，這並不能跟蹤頻率。 但是，在將列表縮減為組合重復后，這很容易。

我假設你對跨越列表的關系感興趣..如果你不是那么我不理解你的問題，也不知道為什么你有多個列表。

Answer 6

簡單的解決方案計數對（在純python中）

from string import split
from itertools import combinations, imap, chain
from collections import Counter
from functools import partial

data = iter([
    'disney england france',
    'disney japan',
    'england london',
    'disney france',
])  # NOTE could have been: data = open('file.txt')

split_data = imap(split, data)

pair = partial(combinations, r=2)
observed_pairs = chain.from_iterable(
    imap(pair, split_data)
)
sorted_observed_pairs = imap(sorted, observed_pairs)
hashable_sorted_observed_pairs = imap(tuple, sorted_observed_pairs)
pair_count = lambda: Counter(hashable_sorted_observed_pairs)  # tada!
print(pair_count())

識別列表中的元素 - 機器學習

問題描述

6 個解決方案

解決方案1
2 已采納 2014-02-14 02:32:26

解決方案2
0 2014-02-14 01:18:05

解決方案3
0 2014-02-14 01:30:02

解決方案4
0 2014-02-14 02:07:30

解決方案5
0 2014-02-14 02:43:14

解決方案6
0 2014-02-14 19:39:31

識別列表中的元素 - 機器學習

問題描述

6 個解決方案

解決方案1 2 已采納 2014-02-14 02:32:26

解決方案2 0 2014-02-14 01:18:05

解決方案3 0 2014-02-14 01:30:02

解決方案4 0 2014-02-14 02:07:30

解決方案5 0 2014-02-14 02:43:14

解決方案6 0 2014-02-14 19:39:31

解決方案1
2 已采納 2014-02-14 02:32:26

解決方案2
0 2014-02-14 01:18:05

解決方案3
0 2014-02-14 01:30:02

解決方案4
0 2014-02-14 02:07:30

解決方案5
0 2014-02-14 02:43:14

解決方案6
0 2014-02-14 19:39:31