简体   繁体   中英

Comprehension dictionary counter and refactoring python code

I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.

I tried to do a comprehension dictionary for word_dict , but I don't find a way to do it. I had two problems with it:

  • I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
  • I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.

The comprehension dictionary is:

word_dict = {word_dict[word]:= 0 if word not in word_dict else word_dict[word]:= word_dict[word] + 1 for word in text_split}

Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.

text = "hello Hello, water! WATER:HELLO. water , HELLO"

# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello  water  WATER HELLO  water   HELLO'

# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']

word_dict = {}

for word in text_split:
    if word not in word_dict:
        word_dict[word] = 0 
    word_dict[word] += 1

word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}

Welcome to Python. There is the library collections ( https://docs.python.org/3/library/collections.html ), which has a class called Counter. It seems very likely that this could fit in your code. Is that a take?

from collections import Counter
...
word_dict = Counter(text_split)

Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:

import re
from collections import Counter

text = "hello Hello, water! WATER:HELLO. water , HELLO"

pattern = r"\b\w+\b"

print(Counter(re.findall(pattern, text)))

Output:

Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>> 

Here's what the regex pattern is composed of:

  • \b - represents a word boundary (will not be included in the match)
  • \w+ - one or more characters from the set [a-zA-Z0-9_] .
  • \b - another word boundary (will also not be included in the match)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM