简体   繁体   中英

How to delete specific words from a sentence in text file?

I have two text files. The 1st file contains English sentences and 2nd file contains a number of English words (vocabulary). I want to remove those words from the sentences in the 1st file which are not present in the vocabulary and then to save the processed text back into the 1st file.

I wrote the code from which I am able to get those sentences which contains the words that are not available in our 2nd file (vocabulary).

Here is my code:

s = open('eng.txt').readlines()

for i in s:

print(i)

for word in i.split(' '):
    print(word)
    if word in open("vocab30000.txt").read():
        print("Word exist in vocab")
    else:

        #print("I:", i)
        print("Word does not exist")
        #search_in_file_func(i)
        print("I:", i)
        file1 = open("MyFile.txt","a+") 
        if i in file1:
            print("Sentence already exist")
        else:
            file1.write(i)

However, I am not able to remove those words.

This should work:

with open('vocab30000.txt') as f:
    vocabulary = set(word.strip() for word in f.readlines())

with open('eng.txt', 'r+') as f:
    data = [line.strip().split(' ') for line in f.readlines()]
    removed = [[word for word in line if word in vocabulary] for line in data]
    result = '\n'.join(' '.join(word for word in line) for line in removed)
    f.seek(0)
    f.write(result)
    f.truncate()
#Read the two files

with open('vocab30000.txt') as f:
    vocabulary = f.readlines()

with open('eng.txt', 'r+') as f:
    eng = f.readlines()

vocab_sentences = [i.split(" ") for i in vocabulary]
eng = [i.split(" ") for i in eng]

cleaned_sentences = []
# loop over the sentences and exclude words in eng
for sent in vocab_sentences:
    cleaned_sentences.append(" ".join([i for i in sent if i not in eng]))
#write the file
with open('vocab30000.txt', 'w') as f:
    f.writelines(cleaned_sentences)

You can try this code. I tried not to use any loops to save your runtime if you have larger files.

import re

with open('eng.txt', 'r') as f:
    s = f.read()
s_copy = s

punctuation = [".","\"",",","-","(",")","[","]"]

pattern = re.compile("\\b("+"|".join(punctuation)+")\\W", re.I)
s_copy = pattern.sub(" ", s_copy)
s_copy = s_copy.replace("\"","")
s_words = s_copy.split(" ")

with open('vocab30000.txt', 'r') as f:
    check_words = f.read()

remove_words = list(set(s_words) - set(check_words))

pattern = re.compile("\\b("+"|".join(remove_words[1:])+")\\W", re.I)
pattern.sub("", s)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM