从文件创建字典

Question

I am doing the count of every word now.我现在正在计算每个单词。

I want to add the count of all of them, which means I need to remove the punctuation after and before the word.我想添加所有这些的计数，这意味着我需要删除单词前后的标点符号。

Can someone help please?有人可以帮忙吗？

Answer 1

You could use regex and simplify the whole lot :您可以使用正则表达式并简化整个过程：

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = {word:words.count(word) for word in set(words)}
    return words

From the doc :从文档：

\\W Matches any character which is not a word character. \\W匹配任何不是单词字符的字符。 This is the opposite of \\w.这与\\w相反。 If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_].如果使用 ASCII 标志，则这相当于 [^a-zA-Z0-9_]。 If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.如果使用 LOCALE 标志，则匹配在当前语言环境中既不是字母数字也不是下划线的字符。

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. +导致生成的 RE 匹配前面 RE 的 1 次或多次重复。 ab+ will match 'a' followed by any non-zero number of 'b's; ab+ 将匹配“a”，后跟任何非零数量的“b”； it will not match just 'a'.它不会只匹配“a”。

So \\W+ will split on all characters being anything else than a to z, A to Z, 0 to 9 and _.所以 \\W+ 将拆分所有字符，而不是 a 到 z、A 到 Z、0 到 9 和 _。 As is suggested in comments, it can be "language" sensitive (non unicode characters, for example).正如评论中所建议的，它可以是“语言”敏感的（例如，非 unicode 字符）。 In that case, you can adapt this code to your language by setting在这种情况下，您可以通过设置使此代码适应您的语言

words = re.split('[^a-zA-Z0-9_àéèêùç'])

EDIT To use Stef's suggestion which is indeed faster :编辑使用 Stef 的建议确实更快：

from collections import Counter
def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = Counter(words)
    return words

EDIT 2 Without any regex or other libraries, but this not efficient :编辑 2没有任何正则表达式或其他库，但这效率不高：

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    split_on = {"'", ","}
    for separator in split_on:
      txt = txt.replace(separator, ' ')
    words = txt.split()
    dict_words = dict()
    for word in set(words):
      if word in dict_words:
        dict_words[word] += dict_words[word] +1
      else
        dict_words[word] = 1
    
    return dict_words

Answer 2

Here are a few suggestions:以下是一些建议：

Use collections.Counter which was designed specifically for this;使用专为此设计的collections.Counter ；
Use .strip() instead of .strip(' ') to strip all whitespace, including tabs and newlines, rather than just spaces;使用.strip()而不是.strip(' ')所有空格，包括制表符和换行符，而不仅仅是空格；
Remove punctuation according to this answer .根据这个答案删除标点符号。

With all this in mind, the code is only two lines long:考虑到所有这些，代码只有两行：

import collections
import string

with open('loremipsum.txt', 'r') as f:
  wordcounts = collections.Counter(f.read().lower().translate(str.maketrans('', '', string.punctuation)).split())

print(wordcounts)
# Counter({'eget': 11, 'vitae': 9, 'ut': 9, 'pellentesque': 9, 'sed': 8, 'pretium': 8, 'eu': 8, 'ipsum': 7, 'donec': 7, 'venenatis': 7, 'in': 7, 'lorem': 6, 'sit': 6, 'amet': 6, 'non': 6, 'a': 6, 'enim': 6, 'vestibulum': 6, 'at': 6, 'id': 5, 'et': 5, 'blandit': 5, 'risus': 5, 'tincidunt': 5, 'nibh': 5, 'vulputate': 5, 'ligula': 5, 'quam': 5, 'porttitor': 5, 'lacus': 5, 'vel': 5, 'dolor': 4, 'consectetur': 4, 'elit': 4, 'tortor': 4, 'malesuada': 4, 'mollis': 4, 'sapien': 4, 'est': 4, 'faucibus': 4, 'integer': 4, 'justo': 4, 'tellus': 4, 'quis': 4, 'purus': 4, 'aliquet': 4, 'posuere': 4, 'nisi': 4, 'euismod': 4, 'tempor': 4, 'cras': 4, 'curabitur': 4, 'placerat': 4, 'vehicula': 4, 'nec': 4, 'suscipit': 4, 'augue': 4, 'dapibus': 4, 'finibus': 3, 'efficitur': 3, 'facilisis': 3, 'eros': 3, 'nulla': 3, 'ullamcorper': 3, 'dui': 3, 'nisl': 3, 'eleifend': 3, 'magna': 3, 'consequat': 3, 'arcu': 3, 'sagittis': 3, 'aliquam': 3, 'sem': 3, 'felis': 3, 'condimentum': 3, 'metus': 3, 'phasellus': 3, 'velit': 3, 'mi': 3, 'congue': 3, 'maecenas': 3, 'gravida': 3, 'viverra': 3, 'cursus': 3, 'nullam': 3, 'molestie': 3, 'odio': 3, 'interdum': 3, 'massa': 3, 'libero': 3, 'etiam': 3, 'accumsan': 3, 'porta': 3, 'adipiscing': 2, 'proin': 2, 'lectus': 2, 'rutrum': 2, 'mauris': 2, 'rhoncus': 2, 'feugiat': 2, 'dictum': 2, 'nunc': 2, 'semper': 2, 'per': 2, 'sollicitudin': 2, 'volutpat': 2, 'leo': 2, 'suspendisse': 2, 'nam': 2, 'hendrerit': 2, 'erat': 2, 'ex': 2, 'laoreet': 2, 'ac': 2, 'imperdiet': 2, 'ante': 2, 'lacinia': 2, 'fringilla': 2, 'morbi': 2, 'varius': 1, 'lobortis': 1, 'pulvinar': 1, 'mattis': 1, 'class': 1, 'aptent': 1, 'taciti': 1, 'sociosqu': 1, 'ad': 1, 'litora': 1, 'torquent': 1, 'conubia': 1, 'nostra': 1, 'inceptos': 1, 'himenaeos': 1, 'iaculis': 1, 'luctus': 1, 'dignissim': 1, 'potenti': 1, 'egestas': 1, 'fusce': 1, 'turpis': 1, 'tempus': 1, 'praesent': 1, 'pharetra': 1, 'vivamus': 1, 'ultrices': 1, 'maximus': 1, 'commodo': 1, 'ultricies': 1, 'elementum': 1, 'fames': 1, 'primis': 1, 'tristique': 1, 'diam': 1, 'scelerisque': 1})

If you don't like one-liners, you can split the previous code into several lines:如果你不喜欢单行，你可以把前面的代码分成几行：

import collections
import string

with open('loremipsum.txt', 'r') as f:
  text = f.read()
  lowertext = text.lower()
  without_punctuation = lowertext.translate(str.maketrans('', '', string.punctuation))
  words = without_punctuation.split()
  wordcounts = collections.Counter(words)

Finally, an alternative: read the file line by line instead of all at once:最后，另一种选择：逐行读取文件而不是一次读取所有文件：

import collections
import string

wordcounts = collections.Counter()
with open('loremipsum.txt', 'r') as f:
  for line in f:
    words = line.lower().translate(str.maketrans('', '', string.punctuation)).split(' ')
    wordcounts.update(collections.Counter(words))

从文件创建字典

问题描述

2 个解决方案

解决方案1
0 2020-11-23 14:17:31

解决方案2
0 2020-11-23 14:30:08

从文件创建字典

问题描述

2 个解决方案

解决方案1 0 2020-11-23 14:17:31

解决方案2 0 2020-11-23 14:30:08

解决方案1
0 2020-11-23 14:17:31

解决方案2
0 2020-11-23 14:30:08