简体   繁体   中英

Creating a dictionary with calculated values

I have a large text string and I would like to create a dictionary with a key = a pair of words (have to go through all possible combinations) in the string and the value = frequency of a given pair of words. Thus, it is a 2D matrix and each matrix element is a number (a frequency of the pair from a column and a row crossing each other. The position of the words in the pair is irrelevant: eg if ridebike = 4 (a frequency) then bikeride = 4 as well

The end result is to populate the matrix and then select N number of top pairs.

I am new working with text strings and with Python in general and I got hopelessly lost (also way too many loops in my "code")

This is what I have (after deleting stopwords and punctuations):

textNP = 'stopped traffic bklyn  bqe  278 wb manhattan brtillary stx29  wb  cadman pla  hope  oufootball makes safe manhattan kansas tomorrow  boomersooner  beatwildcats  theyhateuscuztheyaintus  hatersgonnahate rt  bringonthecats  bring cats exclusive  live footage oklahoma trying get manhattan  http  colktsoyzvvz rt  jonfmorse  bring cats exclusive  live footage oklahoma trying get manhattan'

Some code (incomplete and wrong):

txtU = set(textNP)
lntxt = len(textNP)
lntxtS = len(txtU)

matrixNP = {}

for b1, i1 in txtU: 
    for b2, i2 in txtU:
        if i1< i2:
            bb1 = b1+b2
            bb2 = b2+b1

            freq = 0

            for k in textNP:
                for j in textNP:
                    if k < j:

                        kj = k+j
                        if kj == bb1 | kj == bb2:

                            freq +=1

            matrixNP[i1][i2] = freq
            matrixNP[i2][i1] = freq

        elif i1 == i2: matrixNP[i1][i1] = 1

One of the issues that I am certain that having many loops is wrong. Also, I am not sure how to assign calculated keys (concatenation of words) to a dictionary (I think I got the values correctly)

The text string is not a complete product: it will be cleaned from numbers and few other things with various regexs

Your help will be very much appreciated!

Are you looking for all combinations of 2 words, if so you can use itertools.combinations and a collections.Counter to do what you want:

>>> from itertools import combinations
>>> from collections import Counter
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in combinations(textNP.split(), 2))
>>> c.most_common(N)
[(('manhattan', 'rt'), 8),
 (('exclusive', 'manhattan'), 8),
 (('footage', 'manhattan'), 8),
 (('manhattan', 'oklahoma'), 8),
 (('bring', 'manhattan'), 8)]

Or are you looking for all pairs of consecutive words then you can create a pairwise function:

>>> from itertools import tee
>>> from collections import Counter
>>> def pairwise(iterable):
...     a, b = tee(iterable)
...     next(b, None)
...     return zip(a, b)    # itertools.izip() in python2
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in pairwise(textNP.split()))
>>> c.most_common(N)
[(('get', 'manhattan'), 2),
 (('footage', 'live'), 2),
 (('get', 'trying'), 2),
 (('bring', 'cats'), 2),
 (('exclusive', 'live'), 2)]

Neither way do I see bike ride in the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM