I have developed a code that is responsible for reading the words of a txt file, in my case "elquijote.txt" to then use a dictionary {key: value} to show the words that appear and their occurrences.
For example for a file "test1.txt" with the following words:
hello hello hello good bye bye
The output of my program is:
hello 3
good 1
bye 2
Another of the options that the program has, is that it shows those words that appear a greater number of times than a number introduced by us through an argument.
If in the shell, we put the following command "python readingwords.py text.txt 2" , will show those words contained in the file "test1.txt" that appear more times than the number that we have entered, in this case 2
Output:
hello 3
Now we can introduce a third argument of common words such as determinates conjunctions, which, being so generic, we do not want to be shown or introduced in our dictionary.
My code works correctly, the problem is that using huge files, such as "elquijote.txt" takes a long time to complete the process.
I have been thinking and it is because of the use I make of my auxiliary lists for the elimination of words.
I have thought as a solution not to introduce in my lists those words that appear in txt file that is entered by argument, which contains the words to discard.
Here is my code:
def contar(aux):
counts = {}
for palabra in aux:
palabra = palabra.lower()
if palabra not in counts:
counts[palabra] = 0
counts[palabra] += 1
return counts
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1],'r') as f:
aux = ''.join(c for c in f.read() if c not in characters)
aux = aux.split()
if (len(sys.argv)>3):
with open(sys.argv[3], 'r') as f:
remove = "".join(c for c in f.read())
remove = remove.split()
#Borrar del archivo
for word in aux:
if word in remove:
aux.remove(word)
counts = contar(aux)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
Contar function introduces the words in the dictionary.
And main function introduces in an "aux" list those words that do not contain symbolic characters and then deletes from the same list those "forbidden" words loaded from another .txt file.
I think the correct solution would be to discard the forbidden words where I discard symbols that are not accepted, but after trying several ways I have not managed to do it correctly.
Here you can test my code online: https://repl.it/Nf3S/54 Thanks.
Here are a couple optimisations:
Speeds things up a little, but not by an order of magnitude.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import collections
def contar(aux):
return collections.Counter(aux)
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1],'r') as f:
text = f.read().lower().translate(None, characters)
aux = text.split()
if (len(sys.argv)>3):
with open(sys.argv[3], 'r') as f:
remove = set(f.read().strip().split())
else:
remove = []
counts = contar(aux)
for r in remove:
counts.pop(r, None)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
There are a few inefficiencies here. I've rewritten your code to take advantage of a some of these optimizations. The reasoning for each change are in the comments / doc strings:
# -*- coding: utf-8 -*-
import sys
from collections import Counter
def contar(aux):
"""Here I replaced your hand made solution with the
built-in Counter which is quite a bit faster.
There's no real reason to keep this function, I left it to keep your code
interface intact.
"""
return Counter(aux)
def replace_special_chars(string, chars, replace_char=" "):
"""Replaces a set of characters by another character, a space by default
"""
for c in chars:
string = string.replace(c, replace_char)
return string
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1], "r") as f:
# You were calling lower() once for every `word`. Now we only
# call it once for the whole file:
contents = f.read().strip().lower()
contents = replace_special_chars(contents, characters)
aux = contents.split()
#Borrar del archivo
if len(sys.argv) > 3:
with open(sys.argv[3], "r") as f:
# what you had here was very ineffecient:
# remove = "".join(c for c in f.read())
# that would create an array or characters then join them together as a string.
# this is a bit silly because it's identical to f.read():
# "".join(c for c in f.read()) === f.read()
ignore_words = set(f.read().strip().split())
"""ignore_words is a `set` to allow for very fast inclusion/exclusion checks"""
aux = (word for word in aux if word not in ignore_words)
counts = contar(aux)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
A few changes and reasoning:
__name__ == 'main'
: By doing this you enforce modularity of your code because it only asks for command line arguments when you run this script itself as opposed to importing the function from another script. [aA-zZ0-9]+
. try
except
blocks to attempt to define the minimum count as sys.argv[2]
and catch the exception of an IndexError
to default the minimum count to 0
. Python script:
# sys
import sys
# regex
import re
def main(text_file, min_count):
word_count = {}
with open(text_file, 'r') as words:
# Clean words of linebreaks and split
# by ' ' to get list of words
words = words.read().strip().split(' ')
# Filter words that are not alphanum
pattern = re.compile(r'^[aA-zZ0-9]+$')
words = filter(pattern.search,words)
# Iterate through words and collect
# count
for word in words:
if word in word_count:
word_count[word] = word_count[word] + 1
else:
word_count[word] = 1
# Iterate for output
for word, count in word_count.items():
if count > min_count:
print('%s %s' % (word, count))
if __name__ == '__main__':
# Get text file name
text_file = sys.argv[1]
# Attempt to get minimum count
# from command line.
# Default to 0
try:
min_count = int(sys.argv[2])
except IndexError:
min_count = 0
main(text_file, min_count)
Text file:
hello hello hello good bye goodbye !bye bye¶ b?e goodbye
Command:
python script.py text.txt
Output:
bye 1
good 1
hello 3
goodbye 2
With minimum count command:
python script.py text.txt 2
Output:
hello 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.