繁体   English   中英

正则表达式在单词和标点符号之间添加 NOT

[英]regex to prepend NOT between word and punctuation

我试图使用正则表达式重现经典的标记化技巧来处理像这样的句子

"I didn't like that SO question, but I like pizza!"

文献中提出的解决方案实际上非常简单。 在 "didnt' 和下一个标点符号之间添加NOT_每个标记。所以在我们的示例中,这变为:

"I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!"

我们如何使用 python 或正则表达式来做到这一点?

谢谢!

import re

text = "I didn't like that SO question, but I like pizza!"

regex = re.compile(r'(?<=didn\'t)(\s.+)+\,')

segment = regex.search(text).group(0)

result = text.replace(segment, segment.replace(' ', ' Not_'))

print(result)
"I didn't Not_like Not_that Not_SO Not_question, but I like pizza!"

使用正则表达式进行标记,然后像这样拆分和加入:

import re
sentence = "I didn't like that SO question, but I like pizza!"
words = re.split("([,.?:!;]|didn't)", sentence)
not_sentence = "".join([word if (idx == 0 or words[idx-1] != "didn't")
                        else re.sub(r"(\w+)", "NOT_\\1", word)
                        for idx, word in enumerate(words)])
print(not_sentence)
# I didn't NOT_like NOT_that NOT_SO NOT_question, but I like pizza!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM