I'm trying to create a bigram from a dictionary with a specific condition. Below is the example of the dictionary:
dict_example = {'keywords1': ['africa',
'basic service',
'class',
'develop country',
'disadvantage',
'economic resource',
'social protection system']
The specific condition is that I want to create a bigram if the words in each element are more than 1. Below is the code that I have been working on so far:
keywords_bigram_temp = {}
keywords_bigram = {}
for k, v in dict_example.items():
keywords_bigram_temp.update({k: [word_tokenize(w) for w in v]})
for k2, v2 in keywords_bigram_temp.items():
keywords_bigram.update({k2: [list(nltk.bigrams(v3)) for v3 in v2 if len(v3) > 1]})
This code works, but instead of returning a normal tuple within a list (I think this is what bigram normally looked like), it returns a tuple within a nested list. Below is an example of the result:
'keywords1': [[('basic', 'service')],
[('develop', 'country')],
[('economic', 'resource')],
[('social', 'protection'),
('protection', 'system'),
('system', 'social'),
('social', 'protection')]}
What I need is a normal bigram structure, a tuple within a list like so:
'keywords1': [('basic', 'service'),
('develop', 'country'),
('economic', 'resource'),
('social', 'protection'),
('protection', 'system'),
('system', 'social'),
('social', 'protection')]}
One simple approach is to do the following:
bigrams = []
for string in dict_example['keywords1']:
chunks = string.split()
if len(chunks) > 1:
bigrams.extend(zip(chunks, chunks[1:]))
res = { 'keywords1' : bigrams }
print(res)
Output
{'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('protection', 'system')]}
Here's a way to do what your question asks using itertools.combinations()
:
from itertools import combinations
keywords_bigram = {'keywords1': [x for elem in dict_example['keywords1'] if ' ' in elem for x in combinations(elem.split(), 2)]}
Output:
{'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('social', 'system'), ('protection', 'system')]}
Explanation:
for elem in dict_example['keywords1'] if ' ' in elem
to iterate over all items in the list
associated with keywords1
that contain a ' '
character, meaning the words in the element number more than 1for x in combinations(elem.split(), 2)
to include every unique combination of 2 words within the multi-word item
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.