简体   繁体   English

Pandas 组合字符串的意外拆分并匹配替换

[英]Pandas combine unexpected splits of string and match to replace

I have a text paragraph like below in my powerpoint file我的 powerpoint 文件中有如下文本段落

para = "XX NOV 2021, Time: xx:xx – xx:xx hrs (90mins)"

Now this para (string type) is split based on inbuilt PPT logic which results in unexpected splits of para into keywords like below.现在这个 para(字符串类型)是根据内置的 PPT 逻辑拆分的,这会导致 para 意外拆分为如下所示的关键字。 (I don't control this split logic). (我不控制这种拆分逻辑)。 Though out of scope for this question, If you wish to know more about my problem, you can refer this post here虽然这个问题的 scope,如果你想了解更多关于我的问题,你可以在这里参考这篇文章

split_list = ["XX Nov", "2021," ,"Time:xx", ":xx - xx:xx", " hrs (90mins)"]

Now, my objective is to replace the keyword Nov 2021 from para to Nov 2022 (like doing CTRL+F and replace)现在,我的目标是将关键字Nov 2021从 para 替换为Nov 2022 (例如 CTRL+F 和替换)

So, I tried the below所以,我尝试了以下

for s in split_list:
   print(type(s))   # str type is returned
   cur_text = s
   new_text = cur_text.replace("Nov 2021", "Nov 2022")
   split_list.update(s)
new_para = ' '.join(split_list)

As expected, this doesn't do the replacement because my search term Nov 2021 doesn't find a match because the strings are stored as XX Nov and 2021 etc.正如预期的那样,这不会进行替换,因为我的搜索词Nov 2021找不到匹配项,因为字符串存储为XX Nov2021等。

How can we combine the previous N keywords to the current keyword in split_list and do the replacement.我们如何将之前的 N 个关键字组合到 split_list 中的当前关键字并进行替换。 N can range from 1 to 3. N 的范围可以从 1 到 3。

Is there any python for loop solution (where we can look at previous and current keywords at the same time) etc?是否有任何 python 循环解决方案(我们可以同时查看以前和当前关键字)等?

Please note that I cannot do the replacement at the input para because it will lose all text formatting such as bold, italic, formatting etc. So, we do replacement at the keyword list (from split_list)请注意,我无法在输入参数处进行替换,因为它将丢失所有文本格式,例如粗体、斜体、格式等。因此,我们在关键字列表中进行替换(来自para

Basically, I expect my final output to be like below.基本上,我希望我的最终 output 如下所示。

para = "XX NOV 2022, Time: xx:xx – xx:xx hrs (90mins)"

Simple bigram generation of sequence:简单的二元序列生成:

sentence = " ".join(split_list)
size_n_gram = 2
words = sentence.split(" ")
n_gram_words = []
for i, _ in enumerate(words):
    if i + size_n_gram >= len(words):
        # Have all the words
        break
    else:
        n_gram_words.append(words[i:i+n_gram_words])

# Here we hold all two word sequences and can then search in n_gram_words
# for which words to be replaced. Then combine the correct string.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM