简体   繁体   中英

Create dictionary from file

I am doing the count of every word now.

I want to add the count of all of them, which means I need to remove the punctuation after and before the word.

Can someone help please?

You could use regex and simplify the whole lot :

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = {word:words.count(word) for word in set(words)}
    return words

From the doc :

\\W Matches any character which is not a word character. This is the opposite of \\w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.

So \\W+ will split on all characters being anything else than a to z, A to Z, 0 to 9 and _. As is suggested in comments, it can be "language" sensitive (non unicode characters, for example). In that case, you can adapt this code to your language by setting

words = re.split('[^a-zA-Z0-9_àéèêùç'])

EDIT To use Stef's suggestion which is indeed faster :

from collections import Counter
def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = Counter(words)
    return words

EDIT 2 Without any regex or other libraries, but this not efficient :

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    split_on = {"'", ","}
    for separator in split_on:
      txt = txt.replace(separator, ' ')
    words = txt.split()
    dict_words = dict()
    for word in set(words):
      if word in dict_words:
        dict_words[word] += dict_words[word] +1
      else
        dict_words[word] = 1
    
    return dict_words

Here are a few suggestions:

  • Use collections.Counter which was designed specifically for this;
  • Use .strip() instead of .strip(' ') to strip all whitespace, including tabs and newlines, rather than just spaces;
  • Remove punctuation according to this answer .

With all this in mind, the code is only two lines long:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  wordcounts = collections.Counter(f.read().lower().translate(str.maketrans('', '', string.punctuation)).split())

print(wordcounts)
# Counter({'eget': 11, 'vitae': 9, 'ut': 9, 'pellentesque': 9, 'sed': 8, 'pretium': 8, 'eu': 8, 'ipsum': 7, 'donec': 7, 'venenatis': 7, 'in': 7, 'lorem': 6, 'sit': 6, 'amet': 6, 'non': 6, 'a': 6, 'enim': 6, 'vestibulum': 6, 'at': 6, 'id': 5, 'et': 5, 'blandit': 5, 'risus': 5, 'tincidunt': 5, 'nibh': 5, 'vulputate': 5, 'ligula': 5, 'quam': 5, 'porttitor': 5, 'lacus': 5, 'vel': 5, 'dolor': 4, 'consectetur': 4, 'elit': 4, 'tortor': 4, 'malesuada': 4, 'mollis': 4, 'sapien': 4, 'est': 4, 'faucibus': 4, 'integer': 4, 'justo': 4, 'tellus': 4, 'quis': 4, 'purus': 4, 'aliquet': 4, 'posuere': 4, 'nisi': 4, 'euismod': 4, 'tempor': 4, 'cras': 4, 'curabitur': 4, 'placerat': 4, 'vehicula': 4, 'nec': 4, 'suscipit': 4, 'augue': 4, 'dapibus': 4, 'finibus': 3, 'efficitur': 3, 'facilisis': 3, 'eros': 3, 'nulla': 3, 'ullamcorper': 3, 'dui': 3, 'nisl': 3, 'eleifend': 3, 'magna': 3, 'consequat': 3, 'arcu': 3, 'sagittis': 3, 'aliquam': 3, 'sem': 3, 'felis': 3, 'condimentum': 3, 'metus': 3, 'phasellus': 3, 'velit': 3, 'mi': 3, 'congue': 3, 'maecenas': 3, 'gravida': 3, 'viverra': 3, 'cursus': 3, 'nullam': 3, 'molestie': 3, 'odio': 3, 'interdum': 3, 'massa': 3, 'libero': 3, 'etiam': 3, 'accumsan': 3, 'porta': 3, 'adipiscing': 2, 'proin': 2, 'lectus': 2, 'rutrum': 2, 'mauris': 2, 'rhoncus': 2, 'feugiat': 2, 'dictum': 2, 'nunc': 2, 'semper': 2, 'per': 2, 'sollicitudin': 2, 'volutpat': 2, 'leo': 2, 'suspendisse': 2, 'nam': 2, 'hendrerit': 2, 'erat': 2, 'ex': 2, 'laoreet': 2, 'ac': 2, 'imperdiet': 2, 'ante': 2, 'lacinia': 2, 'fringilla': 2, 'morbi': 2, 'varius': 1, 'lobortis': 1, 'pulvinar': 1, 'mattis': 1, 'class': 1, 'aptent': 1, 'taciti': 1, 'sociosqu': 1, 'ad': 1, 'litora': 1, 'torquent': 1, 'conubia': 1, 'nostra': 1, 'inceptos': 1, 'himenaeos': 1, 'iaculis': 1, 'luctus': 1, 'dignissim': 1, 'potenti': 1, 'egestas': 1, 'fusce': 1, 'turpis': 1, 'tempus': 1, 'praesent': 1, 'pharetra': 1, 'vivamus': 1, 'ultrices': 1, 'maximus': 1, 'commodo': 1, 'ultricies': 1, 'elementum': 1, 'fames': 1, 'primis': 1, 'tristique': 1, 'diam': 1, 'scelerisque': 1})

If you don't like one-liners, you can split the previous code into several lines:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  text = f.read()
  lowertext = text.lower()
  without_punctuation = lowertext.translate(str.maketrans('', '', string.punctuation))
  words = without_punctuation.split()
  wordcounts = collections.Counter(words)

Finally, an alternative: read the file line by line instead of all at once:

import collections
import string

wordcounts = collections.Counter()
with open('loremipsum.txt', 'r') as f:
  for line in f:
    words = line.lower().translate(str.maketrans('', '', string.punctuation)).split(' ')
    wordcounts.update(collections.Counter(words))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM