如何從文本文件中的句子中刪除特定單詞？

Question

我有兩個文本文件。 第一個文件包含英文句子，第二個文件包含許多英文單詞（詞匯）。 我想從第一個文件中不存在於詞匯表中的句子中刪除這些單詞，然后將處理后的文本保存回第一個文件中。

我編寫了代碼，從中我可以得到那些包含我們的第二個文件（詞匯表）中沒有的單詞的句子。

這是我的代碼：

s = open('eng.txt').readlines()

for i in s:

print(i)

for word in i.split(' '):
    print(word)
    if word in open("vocab30000.txt").read():
        print("Word exist in vocab")
    else:

        #print("I:", i)
        print("Word does not exist")
        #search_in_file_func(i)
        print("I:", i)
        file1 = open("MyFile.txt","a+") 
        if i in file1:
            print("Sentence already exist")
        else:
            file1.write(i)

但是，我無法刪除這些詞。

Answer 1

這應該工作：

with open('vocab30000.txt') as f:
    vocabulary = set(word.strip() for word in f.readlines())

with open('eng.txt', 'r+') as f:
    data = [line.strip().split(' ') for line in f.readlines()]
    removed = [[word for word in line if word in vocabulary] for line in data]
    result = '\n'.join(' '.join(word for word in line) for line in removed)
    f.seek(0)
    f.write(result)
    f.truncate()

Answer 2

#Read the two files

with open('vocab30000.txt') as f:
    vocabulary = f.readlines()

with open('eng.txt', 'r+') as f:
    eng = f.readlines()

vocab_sentences = [i.split(" ") for i in vocabulary]
eng = [i.split(" ") for i in eng]

cleaned_sentences = []
# loop over the sentences and exclude words in eng
for sent in vocab_sentences:
    cleaned_sentences.append(" ".join([i for i in sent if i not in eng]))
#write the file
with open('vocab30000.txt', 'w') as f:
    f.writelines(cleaned_sentences)

Answer 3

您可以嘗試此代碼。 如果您有更大的文件，我盡量不使用任何循環來保存您的運行時。

import re

with open('eng.txt', 'r') as f:
    s = f.read()
s_copy = s

punctuation = [".","\"",",","-","(",")","[","]"]

pattern = re.compile("\\b("+"|".join(punctuation)+")\\W", re.I)
s_copy = pattern.sub(" ", s_copy)
s_copy = s_copy.replace("\"","")
s_words = s_copy.split(" ")

with open('vocab30000.txt', 'r') as f:
    check_words = f.read()

remove_words = list(set(s_words) - set(check_words))

pattern = re.compile("\\b("+"|".join(remove_words[1:])+")\\W", re.I)
pattern.sub("", s)

如何從文本文件中的句子中刪除特定單詞？

問題描述

3 個解決方案

解決方案1
0 已采納 2019-04-02 06:15:00

解決方案2
0 2019-04-02 06:45:21

解決方案3
0 2019-04-02 07:35:44

如何從文本文件中的句子中刪除特定單詞？

問題描述

3 個解決方案

解決方案1 0 已采納 2019-04-02 06:15:00

解決方案2 0 2019-04-02 06:45:21

解決方案3 0 2019-04-02 07:35:44

解決方案1
0 已采納 2019-04-02 06:15:00

解決方案2
0 2019-04-02 06:45:21

解決方案3
0 2019-04-02 07:35:44