简体   繁体   中英

Read text file and look for certain words from key word list

I am new to Python, and I am trying to build a script where I import text_file_1 that contains a body of text. I want the script to read the body of text, and look for certain words that I have defined in a list called (key_words) that contain words with a Capital letter in the beginning (Nation) and lowercase (nation). After Python does the searching, it will output the list of words vertically in a new text file called "List of Words", along with the number of times that word occurs in the body. If I read text_file_2 with a body of text, it will do the same, but ADD to the List of Words from the original file.


List of Words

File 1:

God: 5
Nation: 4
creater: 8
USA: 3 

File 2:

God: 10
Nation: 14
creater: 2
USA: 1

Here is what I have so far:

from sys import argv
from string import punctuation

script = argv[0] all_filenames = argv[1:]

print "Text file to import and read: " + all_filenames
print "\nReading file...\n"
text_file = open(all_filenames, 'r')
all_lines = text_file.readlines()
#print all_lines

for all_filenames in argv[1:]:
   print "I get: " + all_filenames

print "\nFile read finished!"
#print "\nYour file contains the following text information:"
#print "\n" + text_file.read()

#~ for word, count in word_freq.items():
    #~ print word, count

keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
             'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
             'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
             'constitution', 'Government', 'Citizens', 'citizens']

for word in keyWords:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )

output_file = open("List_of_words.txt", "w")

for word in keyWords:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )


Maybe use this code somehow?

import fileinput
for line in fileinput.input('List_of_words.txt', inplace = True):
    if line.startswith('Existing file that was read'):
        #if line starts Existing file that was read then do something here
        print "Existing file that was read"
    elif line.startswith('New file that was read'):
        #if line starts with New file that was read then do something here
        print "New file that was read"
        print line.strip()

This way you have result on the screen.

from sys import argv
from collections import Counter
from string import punctuation

script, filename = argv

text_file = open(filename, 'r')

word_freq = Counter([word.strip(punctuation) for line in text_file for word in line.split()])

#~ for word, count in word_freq.items():
    #~ print word, count

key_words = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater'
             'Country', 'country', 'People', 'people', 'Liberty', 'liberty',
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage']

for word in key_words:
    if word in word_freq:
        print word, word_freq[word]

Now you have to save it in file.

For more files use

for filename in argv[1:]:
   # do your job


With this code (my_script.py)

for filename in argv[1:]:
   print( "I get", filename )

You can run script

python my_script.py file1.txt file2.txt file3.txt 

and get

I get file1.txt 
I get file2.txt 
I get file3.txt 

You can use it to count words in many files.


Using readlines() you read all lines into memory so you need more memory - for very, very big file it can be problem.

In current version Counter() count all words in all lines - test it - but use less memory.
So using readlines() you get the same word_freq but you use more memory.


writelines(list_of_result) will not add "\\n" after every line - and don't add ':' in "God: 3"

Better use something similar to

output_file = open("List_of_words.txt", "w")

for word in key_words:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )


EDIT: new version - it append result to the end of List_of_words.txt

from sys import argv
from string import punctuation
from collections import *

keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
             'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
             'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
             'constitution', 'Government', 'Citizens', 'citizens']

for one_filename in argv[1:]:

    print "Text file to import and read:", one_filename
    print "\nReading file...\n"

    text_file = open(one_filename, 'r')
    all_lines = text_file.readlines()

    print "\nFile read finished!"

    word_freq = Counter([word.strip(punctuation) for line in all_lines for word in line.split()])

    print "Append result to the end of file: List_of_words.txt"

    output_file = open("List_of_words.txt", "a")

    for word in keyWords:
        if word in word_freq:
            output_file.write( "%s: %d\n" % (word, word_freq[word]) )


EDIT: write sum of results in one file

from sys import argv
from string import punctuation
from collections import *

keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
             'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
             'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
             'constitution', 'Government', 'Citizens', 'citizens']

word_freq = Counter()

for one_filename in argv[1:]:

    print "Text file to import and read:", one_filename
    print "\nReading file...\n"

    text_file = open(one_filename, 'r')
    all_lines = text_file.readlines()

    print "\nFile read finished!"

    word_freq.update( [word.strip(punctuation) for line in all_lines for word in line.split()] )

print "Write sum of results: List_of_words.txt"

output_file = open("List_of_words.txt", "w")

for word in keyWords:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM