I have a large text string and I would like to create a dictionary with a key = a pair of words (have to go through all possible combinations) in the string and the value = frequency of a given pair of words. Thus, it is a 2D matrix and each matrix element is a number (a frequency of the pair from a column and a row crossing each other. The position of the words in the pair is irrelevant: eg if ridebike = 4 (a frequency) then bikeride = 4 as well
The end result is to populate the matrix and then select N number of top pairs.
I am new working with text strings and with Python in general and I got hopelessly lost (also way too many loops in my "code")
This is what I have (after deleting stopwords and punctuations):
textNP = 'stopped traffic bklyn bqe 278 wb manhattan brtillary stx29 wb cadman pla hope oufootball makes safe manhattan kansas tomorrow boomersooner beatwildcats theyhateuscuztheyaintus hatersgonnahate rt bringonthecats bring cats exclusive live footage oklahoma trying get manhattan http colktsoyzvvz rt jonfmorse bring cats exclusive live footage oklahoma trying get manhattan'
Some code (incomplete and wrong):
txtU = set(textNP)
lntxt = len(textNP)
lntxtS = len(txtU)
matrixNP = {}
for b1, i1 in txtU:
for b2, i2 in txtU:
if i1< i2:
bb1 = b1+b2
bb2 = b2+b1
freq = 0
for k in textNP:
for j in textNP:
if k < j:
kj = k+j
if kj == bb1 | kj == bb2:
freq +=1
matrixNP[i1][i2] = freq
matrixNP[i2][i1] = freq
elif i1 == i2: matrixNP[i1][i1] = 1
One of the issues that I am certain that having many loops is wrong. Also, I am not sure how to assign calculated keys (concatenation of words) to a dictionary (I think I got the values correctly)
The text string is not a complete product: it will be cleaned from numbers and few other things with various regexs
Your help will be very much appreciated!
Are you looking for all combinations of 2 words, if so you can use itertools.combinations
and a collections.Counter
to do what you want:
>>> from itertools import combinations
>>> from collections import Counter
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in combinations(textNP.split(), 2))
>>> c.most_common(N)
[(('manhattan', 'rt'), 8),
(('exclusive', 'manhattan'), 8),
(('footage', 'manhattan'), 8),
(('manhattan', 'oklahoma'), 8),
(('bring', 'manhattan'), 8)]
Or are you looking for all pairs of consecutive words then you can create a pairwise function:
>>> from itertools import tee
>>> from collections import Counter
>>> def pairwise(iterable):
... a, b = tee(iterable)
... next(b, None)
... return zip(a, b) # itertools.izip() in python2
>>> N = 5
>>> c = Counter(tuple(sorted(a)) for a in pairwise(textNP.split()))
>>> c.most_common(N)
[(('get', 'manhattan'), 2),
(('footage', 'live'), 2),
(('get', 'trying'), 2),
(('bring', 'cats'), 2),
(('exclusive', 'live'), 2)]
Neither way do I see bike ride in the list.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.