简体   繁体   English

在文本文件中查找新插入的单词

[英]Find new inserted words in text file

I want to find the new words which are inserted into a text file using Python. 我想找到使用Python插入文本文件中的新单词。 For example: 例如:

Old: He is a new employee here.
New: He was a new, employee there.

I want this list of words as output: ['was', ',' ,'there'] 我希望此单词列表作为输出: ['was', ',' ,'there']

I used difflib but it gives me the diff in a bad formatted way using '+', '-' and '?' 我使用了difflib但是使用'+', '-' and '?'以一种错误的格式给了diff。 . I would have to parse the output to find the new words. 我将不得不解析输出以找到新单词。 Is there an easy way to get this done in Python? 有没有简单的方法可以在Python中完成此操作?

You can accomplish this with the re module. 您可以使用re模块完成此操作。

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. 请注意,如果要添加其他标点符号(例如bang或分号),则必须将其添加到正则表达式定义中。 Right now, it only checks for words or commas. 目前,它仅检查单词或逗号。

I used Google Diff-Patch-Match. 我使用了Google Diff-Patch-Match。 It works fine. 工作正常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM