简体   繁体   中英

How to count words (and also those with accents !) in a text file in Python?

I would like to write a script in Python that takes a file.txt as input and would return me a list of words ordered with their frequency. My problem is that my text is in French, hence there is words like "préchauffer" that are counted weirdly with my following script (see below) and that's problematic.

from collections import Counter
import re
from re import split
import io

def format_print(counter):
    lst = counter.items()
    lst.sort(key=lambda (a, b): (b, a))
    for word, count in lst:
        print "%-16s | %16d" % (word, count)

def count_words(filename):
    stop_words = frozenset(['le', 'la', 'des', 'et', 'des', 'dans', 'les', 'de', 'une', 'un',
     'se', 'sa'])
    text = io.open(filename, 'r', encoding='utf8').read()
    words = re.findall(r'\w+', text)
    cap_words = [word.upper() for word in words if word not in stop_words and len(word) > 1]
    word_counts = Counter(cap_words)
    return word_counts

format_print(count_words("extract.txt"))

It would be no problem to remove all the accents in my file.txt but I haven't found a way to do this. Thanks a lot for the help

Example text

étourdi, etourdi, étourdi, préchauffer

Results for the above text :

CHAUFFER         |                1
ETOURDI          |                1
PR               |                1
TOURDI           |                2

My expected results (not formatted here for brevity) would be

  • the best one : ÉTOURDI 2, ETOURDI 1, PRÉCHAUFFER 1 (indeed thanks to Burhan Khalid comment, "salé" and "sale" have different meanings and it would be useful to differentiate them)
  • the "ok" one : ETOURDI 3, PRECHAUFFER 1

If you want to normalize the accentuated strings (like: étourdi becomes etourdi), you can use the very good unidecode module.

Example:

text = u'étourdi, etourdi, étourdi, préchauffer'
words = re.findall(r'\w+', text, re.U)
cap_words = [unidecode.unidecode(word).upper() for word in words]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM