[英]How to remove English words from a file containing Dari words?
如何查找英語單詞並將其從包含達里語單詞的文件中刪除? 我試過這段代碼,但我不知道如何改進它。
inp = open('Dari.pos', 'r')
out = open('DariNER.txt', 'w')
for line in iter(inp):
------------?
out.write(word)
inp.close()
out.close()
您可以安裝和使用nltk
庫。 這為您提供了一個英語單詞列表和一種將每一行拆分為單詞的方法:
from nltk.tokenize import word_tokenize
from nltk.corpus import words
english = words.words()
with open('Dari.pos') as f_input, open('DariNER.txt', 'w') as f_output:
for line in f_input:
f_output.write(' '.join(word for word in word_tokenize(line) if word.lower() not in english) + '\n')
安裝 nltk 后,您應該運行:
import nltk
nltk.download()
並用它來下載words
infile = "Dari.pos"
outfile = "Cleaned_English_Tags.txt"
delete_list = ['NOUN', 'ADJ', 'PUNCT', 'INTJ', 'ADV', 'VERB', 'X', 'CCONJ', 'ADP', 'AUX', 'SCONJ', 'PRON', 'DET', 'NUM', 'AU']
fin = open(infile)
fout = open(outfile, 'w')
for line in fin:
for word in delete_list:
line = line.replace(word, " ")
fout.write(line)
fin.close()
fout.close()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.