Reading words from txt file - Python

Question

I have developed a code that is responsible for reading the words of a txt file, in my case "elquijote.txt" to then use a dictionary {key: value} to show the words that appear and their occurrences.

For example for a file "test1.txt" with the following words:

hello hello hello good bye bye

The output of my program is:

 hello 3
 good  1
 bye   2

Another of the options that the program has, is that it shows those words that appear a greater number of times than a number introduced by us through an argument.

If in the shell, we put the following command "python readingwords.py text.txt 2" , will show those words contained in the file "test1.txt" that appear more times than the number that we have entered, in this case 2

Output:

hello 3

Now we can introduce a third argument of common words such as determinates conjunctions, which, being so generic, we do not want to be shown or introduced in our dictionary.

My code works correctly, the problem is that using huge files, such as "elquijote.txt" takes a long time to complete the process.

I have been thinking and it is because of the use I make of my auxiliary lists for the elimination of words.

I have thought as a solution not to introduce in my lists those words that appear in txt file that is entered by argument, which contains the words to discard.

Here is my code:

def contar(aux):
  counts = {}
  for palabra in aux:
    palabra = palabra.lower()
    if palabra not in counts:
      counts[palabra] = 0
    counts[palabra] += 1
  return counts

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    aux = ''.join(c for c in f.read() if c not in characters)
    aux = aux.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = "".join(c for c in f.read())
      remove = remove.split()

    #Borrar del archivo  
    for word in aux:  
      if word in remove:
        aux.remove(word) 

  counts = contar(aux)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':
    main()

Contar function introduces the words in the dictionary.

And main function introduces in an "aux" list those words that do not contain symbolic characters and then deletes from the same list those "forbidden" words loaded from another .txt file.

I think the correct solution would be to discard the forbidden words where I discard symbols that are not accepted, but after trying several ways I have not managed to do it correctly.

Here you can test my code online: https://repl.it/Nf3S/54 Thanks.

Answer 1

Here are a couple optimisations:

Use collections.Counter() to count items in contar()
Use string.translate() to remove unwanted chars
Pop items from the ignore word list after the count, rather than stripping them from the original text.

Speeds things up a little, but not by an order of magnitude.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import collections  

def contar(aux):
    return collections.Counter(aux)

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    text = f.read().lower().translate(None, characters)
    aux = text.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = set(f.read().strip().split())
  else:
    remove = []

  counts = contar(aux)
  for r in remove:
    counts.pop(r, None)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':
    main()

Answer 2

There are a few inefficiencies here. I've rewritten your code to take advantage of a some of these optimizations. The reasoning for each change are in the comments / doc strings:

# -*- coding: utf-8 -*-
import sys
from collections import Counter


def contar(aux):
    """Here I replaced your hand made solution with the
    built-in Counter which is quite a bit faster.
    There's no real reason to keep this function, I left it to keep your code
    interface intact.
    """
    return Counter(aux)

def replace_special_chars(string, chars, replace_char=" "):
    """Replaces a set of characters by another character, a space by default
    """
    for c in chars:
        string = string.replace(c, replace_char)
    return string

def main():
    characters = '!?¿-.:;-,><=*»¡'
    aux = []
    counts = {}

    with open(sys.argv[1], "r") as f:
        # You were calling lower() once for every `word`. Now we only
        # call it once for the whole file:
        contents = f.read().strip().lower()
        contents = replace_special_chars(contents, characters)
        aux = contents.split()

    #Borrar del archivo
    if len(sys.argv) > 3:
        with open(sys.argv[3], "r") as f:
            # what you had here was very ineffecient:
            # remove = "".join(c for c in f.read())
            # that would create an array or characters then join them together as a string.
            # this is a bit silly because it's identical to f.read():
            # "".join(c for c in f.read()) === f.read()
            ignore_words = set(f.read().strip().split())
            """ignore_words is a `set` to allow for very fast inclusion/exclusion checks"""
            aux = (word for word in aux if word not in ignore_words)

    counts = contar(aux)

    for word, count in counts.items():
        if count > int(sys.argv[2]):
            print word, count


if __name__ == '__main__':
    main()

Answer 3

A few changes and reasoning:

Parse command line arguments under __name__ == 'main' : By doing this you enforce modularity of your code because it only asks for command line arguments when you run this script itself as opposed to importing the function from another script.
Use regex for filtering out words with characters you don't want: Using regex allows you to say either which characters you DO want or which characters you DON'T want, whichever is shorter. In this case hardcoding every special character you don't want is a rather tedious task compared to declaring which characters you do want in a simple regex pattern. In the following script, I filter out words that are not alphanumerical using the pattern [aA-zZ0-9]+ .
Ask for forgiveness before permission : Since the minimum count command line argument is optional it's obviously not always going to be present. Therefore we can be pythonic by using try except blocks to attempt to define the minimum count as sys.argv[2] and catch the exception of an IndexError to default the minimum count to 0 .

Python script:

# sys
import sys
# regex
import re

def main(text_file, min_count):
    word_count = {}

    with open(text_file, 'r') as words:
        # Clean words of linebreaks and split
        # by ' ' to get list of words
        words = words.read().strip().split(' ')

        # Filter words that are not alphanum
        pattern = re.compile(r'^[aA-zZ0-9]+$')
        words = filter(pattern.search,words)

        # Iterate through words and collect
        # count
        for word in words:
            if word in word_count:
                word_count[word] = word_count[word] + 1
            else:
                word_count[word] = 1

    # Iterate for output
    for word, count in word_count.items():
        if count > min_count:
            print('%s %s' % (word, count))

if __name__ == '__main__':
    # Get text file name
    text_file = sys.argv[1]

    # Attempt to get minimum count
    # from command line.
    # Default to 0
    try:
        min_count = int(sys.argv[2])
    except IndexError:
        min_count = 0

    main(text_file, min_count)

Text file:

hello hello hello good bye goodbye !bye bye¶ b?e goodbye

Command:

python script.py text.txt

Output:

bye 1
good 1
hello 3
goodbye 2

With minimum count command:

python script.py text.txt 2

Output:

hello 3

Reading words from txt file - Python

Question

3 answers

solution1
2 2017-11-03 15:01:40

solution2
1 ACCPTED 2017-11-03 14:59:39

solution3
1 2017-11-03 15:39:11

Reading words from txt file - Python

Question

3 answers

solution1 2 2017-11-03 15:01:40

solution2 1 ACCPTED 2017-11-03 14:59:39

solution3 1 2017-11-03 15:39:11

solution1
2 2017-11-03 15:01:40

solution2
1 ACCPTED 2017-11-03 14:59:39

solution3
1 2017-11-03 15:39:11