简体   繁体   中英

Pandas combine unexpected splits of string and match to replace

I have a text paragraph like below in my powerpoint file

para = "XX NOV 2021, Time: xx:xx – xx:xx hrs (90mins)"

Now this para (string type) is split based on inbuilt PPT logic which results in unexpected splits of para into keywords like below. (I don't control this split logic). Though out of scope for this question, If you wish to know more about my problem, you can refer this post here

split_list = ["XX Nov", "2021," ,"Time:xx", ":xx - xx:xx", " hrs (90mins)"]

Now, my objective is to replace the keyword Nov 2021 from para to Nov 2022 (like doing CTRL+F and replace)

So, I tried the below

for s in split_list:
   print(type(s))   # str type is returned
   cur_text = s
   new_text = cur_text.replace("Nov 2021", "Nov 2022")
   split_list.update(s)
new_para = ' '.join(split_list)

As expected, this doesn't do the replacement because my search term Nov 2021 doesn't find a match because the strings are stored as XX Nov and 2021 etc.

How can we combine the previous N keywords to the current keyword in split_list and do the replacement. N can range from 1 to 3.

Is there any python for loop solution (where we can look at previous and current keywords at the same time) etc?

Please note that I cannot do the replacement at the input para because it will lose all text formatting such as bold, italic, formatting etc. So, we do replacement at the keyword list (from split_list)

Basically, I expect my final output to be like below.

para = "XX NOV 2022, Time: xx:xx – xx:xx hrs (90mins)"

Simple bigram generation of sequence:

sentence = " ".join(split_list)
size_n_gram = 2
words = sentence.split(" ")
n_gram_words = []
for i, _ in enumerate(words):
    if i + size_n_gram >= len(words):
        # Have all the words
        break
    else:
        n_gram_words.append(words[i:i+n_gram_words])

# Here we hold all two word sequences and can then search in n_gram_words
# for which words to be replaced. Then combine the correct string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM