[英]Read text file and look for certain words from key word list
我是 Python 新手,我正在嘗試構建一個腳本,在其中導入包含文本正文的 text_file_1。 我希望腳本讀取文本正文,並查找我在名為 (key_words) 的列表中定義的某些單詞,該列表包含開頭為大寫字母 (Nation) 和小寫字母 (nation) 的單詞。 Python 進行搜索后,它會在名為“單詞列表”的新文本文件中垂直輸出單詞列表,以及該單詞在正文中出現的次數。 如果我閱讀帶有文本正文的 text_file_2 ,它會做同樣的事情,但從原始文件添加到單詞列表。
例子:
單詞列表
文件 1:
God: 5
Nation: 4
creater: 8
USA: 3
文件2:
God: 10
Nation: 14
creater: 2
USA: 1
這是我到目前為止所擁有的:
from sys import argv
from string import punctuation
script = argv[0] all_filenames = argv[1:]
print "Text file to import and read: " + all_filenames
print "\nReading file...\n"
text_file = open(all_filenames, 'r')
all_lines = text_file.readlines()
#print all_lines
text_file.close()
for all_filenames in argv[1:]:
print "I get: " + all_filenames
print "\nFile read finished!"
#print "\nYour file contains the following text information:"
#print "\n" + text_file.read()
#~ for word, count in word_freq.items():
#~ print word, count
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
也許以某種方式使用此代碼?
import fileinput
for line in fileinput.input('List_of_words.txt', inplace = True):
if line.startswith('Existing file that was read'):
#if line starts Existing file that was read then do something here
print "Existing file that was read"
elif line.startswith('New file that was read'):
#if line starts with New file that was read then do something here
print "New file that was read"
else:
print line.strip()
這樣你的結果就會出現在屏幕上。
from sys import argv
from collections import Counter
from string import punctuation
script, filename = argv
text_file = open(filename, 'r')
word_freq = Counter([word.strip(punctuation) for line in text_file for word in line.split()])
#~ for word, count in word_freq.items():
#~ print word, count
key_words = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater'
'Country', 'country', 'People', 'people', 'Liberty', 'liberty',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage']
for word in key_words:
if word in word_freq:
print word, word_freq[word]
現在您必須將其保存在文件中。
更多文件使用
for filename in argv[1:]:
# do your job
編輯:
使用此代碼(my_script.py)
for filename in argv[1:]:
print( "I get", filename )
你可以運行腳本
python my_script.py file1.txt file2.txt file3.txt
並得到
I get file1.txt
I get file2.txt
I get file3.txt
您可以使用它來計算許多文件中的字數。
——
使用readlines()
將所有行讀入內存,因此您需要更多內存 - 對於非常非常大的文件,這可能是問題。
在當前版本中Counter()
計算所有行中的所有單詞 - 測試它 - 但使用更少的內存。
因此,使用readlines()
可以獲得相同的word_freq
但會使用更多內存。
——
writelines(list_of_result)
不會在每一行后添加“\\n” - 並且不要在“God:3”中添加“:”
最好使用類似的東西
output_file = open("List_of_words.txt", "w")
for word in key_words:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
編輯:新版本 - 將結果附加到 List_of_words.txt 的末尾
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq = Counter([word.strip(punctuation) for line in all_lines for word in line.split()])
print "Append result to the end of file: List_of_words.txt"
output_file = open("List_of_words.txt", "a")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
編輯:將結果總和寫入一個文件
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
word_freq = Counter()
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq.update( [word.strip(punctuation) for line in all_lines for word in line.split()] )
print "Write sum of results: List_of_words.txt"
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.