I have a dictionary that contains pairs of key:value where the value is a list of strings :
dictionarylst = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
I also have a list of words that can be tokens or bigrams :
wordslist = ["expression 1", "my expression", "other", "blah"]
I am trying to match every word in my wordslist to every text in every value in my dictionary. And when there is a match, I want to replace just that pattern with a white space (but keep the rest of the text) and store the output in a new dictionary with the same keys.
This what I have tried so far :
dictionarycleaned = {}
for key,value in dictionarylst.items():
for text in value :
for word in wordslist :
if word in value :
pattern = re.compile(r'\b({})\b'.format(word))
matches = re.findall(pattern, text)
dictionarycleaned[key] = [re.sub(i,' ', text) for i in matches]
else :
dictionarycleaned[key] = value
This is matching only a small portion of the patterns in my wordlist. I tried different variations : like matching the pattern to the whole list of strings in every value or iterating over wordlist before dictionarylst, but nothing seems to clean all my data (which is very large).
Thank you for your suggestions.
Try this:
import re
import pprint
dictionarylst = {
0: ["example inside some sentence", "something else", "some blah"],
1: ["testing", "some other word"],
2: ["a new expression", "my cat is cute"],
}
wordslist = ["expression 1", "my expression", "other", "blah"]
dictionarycleaned = dictionarylst.copy()
for key, value in dictionarylst.items():
for n, text in enumerate(value):
for word in wordslist:
if word in text:
dictionarycleaned[key][n] = re.sub(r"\b({})\b".format(word), " ", text)
pprint.pprint(dictionarycleaned)
Output:
pako@b00s:~/tests$ python dict.py
{0: ['example inside some sentence', 'something else', 'some '],
1: ['testing', 'some word'],
2: ['a new expression', 'my cat is cute']}
Since it is a plane string replacement and if the words in wordslist cannot contain double quote(") you can simply create a json string from the dict, then do the replacement and regenerate the dict from the modified json string.
A sample program is given below
import json
d = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
words = ["expression 1", "my expression", "other", "blah"]
json_str = json.dumps(d)
for w in words:
str = str.replace(w, " ")
req_dict = json.loads(json_str)
This way you can get rid of multiple looping
replace()
is an inbuilt function in Python programming language that returns a copy of the string where all occurrences of a substring is replaced with another substring. Ex.
dictionarylst = {0:["example inside some sentence", "something else", "some
blah"], 1:["testing", "some other word"],2:["a new expression",
"my cat is cute"]}
wordslist = ["expression 1", "my expression", "other", "blah"]
dictionarycleaned = {}
def match_pattern(wordslist,value):
new_list = []
for text in value:
# temp variable hold latest updated text
temp = text
for word in wordslist:
if word in text:
# replace text string with whitespace if word in text
temp = temp.replace(word,"")
new_list.append(temp)
return new_list
for k,v in dictionarylst.items():
dictionarycleaned[k] = match_pattern(wordslist, v)
print(dictionarycleaned)
O/P:
{0: ['example inside some sentence', 'something else', 'some '], 1: ['testing', 'some
word'], 2: ['a new expression', 'my cat is cute']}
Pako answer is good but you can optimize further by these - Use regular expression to generate the replacement - No need to create a copy of the dictionary: just replace the values with the new list
Full code
import re
import pprint
dictionarylst = {
0: ["example inside some sentence", "something else", "some blah"],
1: ["testing", "some other word"],
2: ["a new expression", "my cat is cute"],
}
regexs = []
wordslist = ["expression 1", "my expression", "other", "blah"]
for word in wordslist:
regexs.append(re.compile(r"\b({})\b".format(word)))
for key, value in dictionarylst.items():
words = [regex.sub(w, ' ') for w in value for regex in regexs]
dictionarylst[key] = words
pprint.pprint(dictionarycleaned)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.