在文本文件中查找新插入的单词

Question

I want to find the new words which are inserted into a text file using Python. 我想找到使用Python插入文本文件中的新单词。 For example: 例如：

Old: He is a new employee here.
New: He was a new, employee there.

I want this list of words as output: ['was', ',' ,'there'] 我希望此单词列表作为输出： ['was', ',' ,'there']

I used difflib but it gives me the diff in a bad formatted way using '+', '-' and '?' 我使用了difflib但是使用'+', '-' and '?'以一种错误的格式给了diff。 . 。 I would have to parse the output to find the new words. 我将不得不解析输出以找到新单词。 Is there an easy way to get this done in Python? 有没有简单的方法可以在Python中完成此操作？

Answer 1

You can accomplish this with the re module. 您可以使用re模块完成此操作。

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. 请注意，如果要添加其他标点符号（例如bang或分号），则必须将其添加到正则表达式定义中。 Right now, it only checks for words or commas. 目前，它仅检查单词或逗号。

Answer 2

I used Google Diff-Patch-Match. 我使用了Google Diff-Patch-Match。 It works fine. 工作正常。

在文本文件中查找新插入的单词

问题描述

2 个解决方案

解决方案1
0 2016-10-29 05:08:52

解决方案2
0 2016-10-29 05:57:22

在文本文件中查找新插入的单词

问题描述

2 个解决方案

解决方案1 0 2016-10-29 05:08:52

解决方案2 0 2016-10-29 05:57:22

解决方案1
0 2016-10-29 05:08:52

解决方案2
0 2016-10-29 05:57:22