简体   繁体   中英

Python : find and replace patterns in the value of dictionary that is a list of strings

I have a dictionary that contains pairs of key:value where the value is a list of strings :

dictionarylst = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}

I also have a list of words that can be tokens or bigrams :

wordslist = ["expression 1", "my expression", "other", "blah"]

I am trying to match every word in my wordslist to every text in every value in my dictionary. And when there is a match, I want to replace just that pattern with a white space (but keep the rest of the text) and store the output in a new dictionary with the same keys.

This what I have tried so far :

dictionarycleaned = {}
for key,value in dictionarylst.items():
    for text in value :
        for word in wordslist :
            if word in value :
                pattern = re.compile(r'\b({})\b'.format(word))
                matches = re.findall(pattern, text)
                dictionarycleaned[key] = [re.sub(i,' ', text) for i in matches]
            else :
                dictionarycleaned[key] = value

This is matching only a small portion of the patterns in my wordlist. I tried different variations : like matching the pattern to the whole list of strings in every value or iterating over wordlist before dictionarylst, but nothing seems to clean all my data (which is very large).

Thank you for your suggestions.

Try this:

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
wordslist = ["expression 1", "my expression", "other", "blah"]

dictionarycleaned = dictionarylst.copy()
for key, value in dictionarylst.items():
    for n, text in enumerate(value):
        for word in wordslist:
            if word in text:
                dictionarycleaned[key][n] = re.sub(r"\b({})\b".format(word), " ", text)

pprint.pprint(dictionarycleaned)

Output:

pako@b00s:~/tests$ python dict.py 
{0: ['example inside some sentence', 'something else', 'some  '],
 1: ['testing', 'some   word'],
 2: ['a new expression', 'my cat is cute']}

Since it is a plane string replacement and if the words in wordslist cannot contain double quote(") you can simply create a json string from the dict, then do the replacement and regenerate the dict from the modified json string.

A sample program is given below

import json

d = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
words = ["expression 1", "my expression", "other", "blah"]

json_str = json.dumps(d)
for w in words:
  str = str.replace(w, " ")

req_dict = json.loads(json_str)

This way you can get rid of multiple looping

  • replace() is an inbuilt function in Python programming language that returns a copy of the string where all occurrences of a substring is replaced with another substring.

Ex.

dictionarylst = {0:["example inside some sentence", "something else", "some 
                  blah"], 1:["testing", "some other word"],2:["a new expression",
                 "my cat is cute"]}

wordslist = ["expression 1", "my expression", "other", "blah"]
dictionarycleaned = {}

def match_pattern(wordslist,value):
    new_list = []
    for text in value:
        # temp variable hold latest updated text
        temp = text
        for word in wordslist:
            if word in text:
                # replace text string with whitespace if word in text
                temp = temp.replace(word,"")
        new_list.append(temp)
    return new_list


for k,v in dictionarylst.items():
    dictionarycleaned[k] = match_pattern(wordslist, v)

print(dictionarycleaned)

O/P:

{0: ['example inside some sentence', 'something else', 'some '], 1: ['testing', 'some  
 word'], 2: ['a new expression', 'my cat is cute']}

Pako answer is good but you can optimize further by these - Use regular expression to generate the replacement - No need to create a copy of the dictionary: just replace the values with the new list

Full code

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
regexs = []
wordslist = ["expression 1", "my expression", "other", "blah"]
for word in wordslist:
    regexs.append(re.compile(r"\b({})\b".format(word)))
for key, value in dictionarylst.items():
    words = [regex.sub(w, ' ') for w in value for regex in regexs]
    dictionarylst[key] = words

pprint.pprint(dictionarycleaned)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM