How to count words (and also those with accents !) in a text file in Python?

Question

I would like to write a script in Python that takes a file.txt as input and would return me a list of words ordered with their frequency. My problem is that my text is in French, hence there is words like "préchauffer" that are counted weirdly with my following script (see below) and that's problematic.

from collections import Counter
import re
from re import split
import io

def format_print(counter):
    lst = counter.items()
    lst.sort(key=lambda (a, b): (b, a))
    for word, count in lst:
        print "%-16s | %16d" % (word, count)

def count_words(filename):
    stop_words = frozenset(['le', 'la', 'des', 'et', 'des', 'dans', 'les', 'de', 'une', 'un',
     'se', 'sa'])
    text = io.open(filename, 'r', encoding='utf8').read()
    words = re.findall(r'\w+', text)
    cap_words = [word.upper() for word in words if word not in stop_words and len(word) > 1]
    word_counts = Counter(cap_words)
    return word_counts

format_print(count_words("extract.txt"))

It would be no problem to remove all the accents in my file.txt but I haven't found a way to do this. Thanks a lot for the help

Example text

étourdi, etourdi, étourdi, préchauffer

Results for the above text :

CHAUFFER         |                1
ETOURDI          |                1
PR               |                1
TOURDI           |                2

My expected results (not formatted here for brevity) would be

the best one : ÉTOURDI 2, ETOURDI 1, PRÉCHAUFFER 1 (indeed thanks to Burhan Khalid comment, "salé" and "sale" have different meanings and it would be useful to differentiate them)
the "ok" one : ETOURDI 3, PRECHAUFFER 1

Answer 1

If you want to normalize the accentuated strings (like: étourdi becomes etourdi), you can use the very good unidecode module.

Example:

text = u'étourdi, etourdi, étourdi, préchauffer'
words = re.findall(r'\w+', text, re.U)
cap_words = [unidecode.unidecode(word).upper() for word in words]

How to count words (and also those with accents !) in a text file in Python?

Question

1 answers

solution1
4 ACCPTED 2015-07-29 06:11:28

How to count words (and also those with accents !) in a text file in Python?

Question

1 answers

solution1 4 ACCPTED 2015-07-29 06:11:28

solution1
4 ACCPTED 2015-07-29 06:11:28