简体   繁体   English

从文件创建字典

[英]Create dictionary from file

I am doing the count of every word now.我现在正在计算每个单词。

I want to add the count of all of them, which means I need to remove the punctuation after and before the word.我想添加所有这些的计数,这意味着我需要删除单词前后的标点符号。

Can someone help please?有人可以帮忙吗?

You could use regex and simplify the whole lot :您可以使用正则表达式并简化整个过程:

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = {word:words.count(word) for word in set(words)}
    return words

From the doc :文档

\\W Matches any character which is not a word character. \\W匹配任何不是单词字符的字符。 This is the opposite of \\w.这与\\w相反。 If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_].如果使用 ASCII 标志,则这相当于 [^a-zA-Z0-9_]。 If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.如果使用 LOCALE 标志,则匹配在当前语言环境中既不是字母数字也不是下划线的字符。

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. +导致生成的 RE 匹配前面 RE 的 1 次或多次重复。 ab+ will match 'a' followed by any non-zero number of 'b's; ab+ 将匹配“a”,后跟任何非零数量的“b”; it will not match just 'a'.它不会只匹配“a”。

So \\W+ will split on all characters being anything else than a to z, A to Z, 0 to 9 and _.所以 \\W+ 将拆分所有字符,而不是 a 到 z、A 到 Z、0 到 9 和 _。 As is suggested in comments, it can be "language" sensitive (non unicode characters, for example).正如评论中所建议的,它可以是“语言”敏感的(例如,非 unicode 字符)。 In that case, you can adapt this code to your language by setting在这种情况下,您可以通过设置使此代码适应您的语言

words = re.split('[^a-zA-Z0-9_àéèêùç'])

EDIT To use Stef's suggestion which is indeed faster :编辑使用 Stef 的建议确实更快:

from collections import Counter
def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = Counter(words)
    return words

EDIT 2 Without any regex or other libraries, but this not efficient :编辑 2没有任何正则表达式或其他库,但这效率不高:

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    split_on = {"'", ","}
    for separator in split_on:
      txt = txt.replace(separator, ' ')
    words = txt.split()
    dict_words = dict()
    for word in set(words):
      if word in dict_words:
        dict_words[word] += dict_words[word] +1
      else
        dict_words[word] = 1
    
    return dict_words

Here are a few suggestions:以下是一些建议:

  • Use collections.Counter which was designed specifically for this;使用专为此设计的collections.Counter
  • Use .strip() instead of .strip(' ') to strip all whitespace, including tabs and newlines, rather than just spaces;使用.strip()而不是.strip(' ')所有空格,包括制表符和换行符,而不仅仅是空格;
  • Remove punctuation according to this answer .根据这个答案删除标点符号。

With all this in mind, the code is only two lines long:考虑到所有这些,代码只有两行:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  wordcounts = collections.Counter(f.read().lower().translate(str.maketrans('', '', string.punctuation)).split())

print(wordcounts)
# Counter({'eget': 11, 'vitae': 9, 'ut': 9, 'pellentesque': 9, 'sed': 8, 'pretium': 8, 'eu': 8, 'ipsum': 7, 'donec': 7, 'venenatis': 7, 'in': 7, 'lorem': 6, 'sit': 6, 'amet': 6, 'non': 6, 'a': 6, 'enim': 6, 'vestibulum': 6, 'at': 6, 'id': 5, 'et': 5, 'blandit': 5, 'risus': 5, 'tincidunt': 5, 'nibh': 5, 'vulputate': 5, 'ligula': 5, 'quam': 5, 'porttitor': 5, 'lacus': 5, 'vel': 5, 'dolor': 4, 'consectetur': 4, 'elit': 4, 'tortor': 4, 'malesuada': 4, 'mollis': 4, 'sapien': 4, 'est': 4, 'faucibus': 4, 'integer': 4, 'justo': 4, 'tellus': 4, 'quis': 4, 'purus': 4, 'aliquet': 4, 'posuere': 4, 'nisi': 4, 'euismod': 4, 'tempor': 4, 'cras': 4, 'curabitur': 4, 'placerat': 4, 'vehicula': 4, 'nec': 4, 'suscipit': 4, 'augue': 4, 'dapibus': 4, 'finibus': 3, 'efficitur': 3, 'facilisis': 3, 'eros': 3, 'nulla': 3, 'ullamcorper': 3, 'dui': 3, 'nisl': 3, 'eleifend': 3, 'magna': 3, 'consequat': 3, 'arcu': 3, 'sagittis': 3, 'aliquam': 3, 'sem': 3, 'felis': 3, 'condimentum': 3, 'metus': 3, 'phasellus': 3, 'velit': 3, 'mi': 3, 'congue': 3, 'maecenas': 3, 'gravida': 3, 'viverra': 3, 'cursus': 3, 'nullam': 3, 'molestie': 3, 'odio': 3, 'interdum': 3, 'massa': 3, 'libero': 3, 'etiam': 3, 'accumsan': 3, 'porta': 3, 'adipiscing': 2, 'proin': 2, 'lectus': 2, 'rutrum': 2, 'mauris': 2, 'rhoncus': 2, 'feugiat': 2, 'dictum': 2, 'nunc': 2, 'semper': 2, 'per': 2, 'sollicitudin': 2, 'volutpat': 2, 'leo': 2, 'suspendisse': 2, 'nam': 2, 'hendrerit': 2, 'erat': 2, 'ex': 2, 'laoreet': 2, 'ac': 2, 'imperdiet': 2, 'ante': 2, 'lacinia': 2, 'fringilla': 2, 'morbi': 2, 'varius': 1, 'lobortis': 1, 'pulvinar': 1, 'mattis': 1, 'class': 1, 'aptent': 1, 'taciti': 1, 'sociosqu': 1, 'ad': 1, 'litora': 1, 'torquent': 1, 'conubia': 1, 'nostra': 1, 'inceptos': 1, 'himenaeos': 1, 'iaculis': 1, 'luctus': 1, 'dignissim': 1, 'potenti': 1, 'egestas': 1, 'fusce': 1, 'turpis': 1, 'tempus': 1, 'praesent': 1, 'pharetra': 1, 'vivamus': 1, 'ultrices': 1, 'maximus': 1, 'commodo': 1, 'ultricies': 1, 'elementum': 1, 'fames': 1, 'primis': 1, 'tristique': 1, 'diam': 1, 'scelerisque': 1})

If you don't like one-liners, you can split the previous code into several lines:如果你不喜欢单行,你可以把前面的代码分成几行:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  text = f.read()
  lowertext = text.lower()
  without_punctuation = lowertext.translate(str.maketrans('', '', string.punctuation))
  words = without_punctuation.split()
  wordcounts = collections.Counter(words)

Finally, an alternative: read the file line by line instead of all at once:最后,另一种选择:逐行读取文件而不是一次读取所有文件:

import collections
import string

wordcounts = collections.Counter()
with open('loremipsum.txt', 'r') as f:
  for line in f:
    words = line.lower().translate(str.maketrans('', '', string.punctuation)).split(' ')
    wordcounts.update(collections.Counter(words))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM