[英]Find new inserted words in text file
我想找到使用Python插入文本文件中的新单词。 例如:
Old: He is a new employee here.
New: He was a new, employee there.
我希望此单词列表作为输出: ['was', ',' ,'there']
我使用了difflib
但是使用'+', '-' and '?'
以一种错误的格式给了diff。 。 我将不得不解析输出以找到新单词。 有没有简单的方法可以在Python中完成此操作?
您可以使用re
模块完成此操作。
import re
# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')
# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."
# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)
# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
if word in old_words:
old_words.remove(word)
else:
word_differences.append(word)
# print it out to verify
print word_differences
请注意,如果要添加其他标点符号(例如bang或分号),则必须将其添加到正则表达式定义中。 目前,它仅检查单词或逗号。
我使用了Google Diff-Patch-Match。 工作正常。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.