简体   繁体   中英

Recognition of elements in a list - machine learning

So I have multiple lists:

['disney','england','france']
['disney','japan']
['england', 'london']
['disney', 'france']

Now I need to determine what in those lists tends to occur together.

For example if we look in this small example, we find 'disney','france' are often in the list together. As the number of documents/lists increase we may find that 'england', is always in the list with 'london'

I've looked at things such as tuples, but this occurs more in language and large text documents. The question here is how to identify these pairings/triples/n attribs that occur together.

EDIT: This is not just looking at pair. What if you had three strings come up together repetedly!

Maybe something like this could be a starting point:

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1

Result:

>>> scores
array([[ 3.,  1.,  2.,  1.,  0.],
       [ 1.,  2.,  1.,  0.,  1.],
       [ 2.,  1.,  2.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.]])

The diagonal element tells you how many times a variable showed up total, and of diagonal tells you how many times, two variables showed up together. This matrix is obviously symmetrical, so it might be possible optimize on that.

Also, if your sublists are very long (you have a lot of variables), but you don't have too many of them, you might consider using a sparse matrix.

Edit:

Here is an idea on how to get triplets and so on.

import numpy as np

# I'll use numbers instead of words,
# but same exact concept
points_list = [[0,1,2],
               [0,3],
               [1,4],
               [0,2],
               [0,1,2,3],
               [0,1,2,4]]

scores = np.zeros((5,5))

for points in points_list:
    temp = np.array(points)[:, np.newaxis]       
    scores[temp, points] += 1


diag = scores.diagonal()

key_col = (scores/diag)[:, 0]
key_col[0] = 0

points_2 = np.where(key_col > 0.5)[0]      # suppose 0.5 is the threshold 
temp_2 = np.array(points_2)[:, np.newaxis] # step 1: we identified the points that are
                                           # close to 0
inner_scores = scores[temp_2, points_2]    # step 1: we are checking if those points are
                                           # are close to each other

Printout

>>> scores
array([[ 5.,  3.,  4.,  2.,  1.], # We identified that 1 and 2 are close to 0
       [ 3.,  4.,  3.,  1.,  2.],
       [ 4.,  3.,  4.,  1.,  1.],
       [ 2.,  1.,  1.,  2.,  0.],
       [ 1.,  2.,  1.,  0.,  2.]])
>>> inner_scores
array([[ 4.,  3.],                # Testing to see whether 1 and 2 are close together
       [ 3.,  4.]])               # Since they are, we can conclude that (0,1,2) occur 
                                  # together

As I see it now, for this idea to work properly, we need a careful recursive implementation, but I hope this helps.

您可以创建一个字典,其中键是您的分组(例如,作为单词的排序列表),并为每个出现次数保留一个计数器。

when short of memory, disk can help, I usually do it this way.

step1. count compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) 

step2. merge the count output by step1 use dict(this process is done in memory per partition,if you are short of memory, choose larger partition_count)

You should build an inverted index (see here ) and then check which terms occur together more often (ie taking two terms, count how many times they occur together). That is, you record for each term the list in which it occurs. This can be pretty efficient if the size of the dictionary (the terms in the lists) isn't too big.

I would use some kind of set logic. If things got big I'd push them to numpy.

100k lists really aren't that huge especially if they're just single words.(I just spent a week working on 6gb version of this problem with over 800 million entries). I'd be more concerned about how many LISTS you have.

This is obviously just a hack but it's in the direction of how I would solve this problem.

import itertools

a = ['disney','england','france']
b = ['disney','japan']
c = ['england', 'london']
d = ['disney', 'france']

g = [a, b, c, d]

for i in range(2, len(g)): 
    for ii in itertools.combinations(g, i):# combinations of list g in sequences from 2 to len(g)
        rr = map(set, ii)
        ixx = None
        for ix in rr:
            if ixx == None:
                ixx = ix
                continue
            ixx = ixx & ix
        if len(ixx) > 1:
            print ixx

result: set(['disney', 'france'])

Obviously this doesn't keep track of the frequencies. But that's easy after you've reduced the lists to combinational repeats.

I've assumed that you're interested in the relationship that span lists.. If you're not then I don't understand your question, nor why you have multiple lists.

Simple solution counting pairs (in pure python)

from string import split
from itertools import combinations, imap, chain
from collections import Counter
from functools import partial

data = iter([
    'disney england france',
    'disney japan',
    'england london',
    'disney france',
])  # NOTE could have been: data = open('file.txt')

split_data = imap(split, data)

pair = partial(combinations, r=2)
observed_pairs = chain.from_iterable(
    imap(pair, split_data)
)
sorted_observed_pairs = imap(sorted, observed_pairs)
hashable_sorted_observed_pairs = imap(tuple, sorted_observed_pairs)
pair_count = lambda: Counter(hashable_sorted_observed_pairs)  # tada!
print(pair_count())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM