简体   繁体   中英

Convert file to dictionary to count occurrences

I have a file where the content looks as follows:

eng word1
eng word2
eng word3
ita word1
ita word2
fra word1
...

I want to count the number of occurrences of each word in every language. For this purpose i want to read the file in a dict. This is my attempt:

data = open('file', 'r', encoding='utf8')
for line in data:
    lang = line[:3]
    ipa_string = line[3:]
    lang_and_string_dict[lang] = []
    lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)

This gives me a dict with the right keys but only the last of the words for example for english:

{'eng':[word1]}

Well each time you assign an empty list as value :

data = open('file', 'r', encoding='utf8')
for line in data:
    lang = line[:3]
    ipa_string = line[3:]
    
    lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)

As a result, the old list containing the previous occurrence is lost. You should only create a list if no such element exists already, like:

data = open('file', 'r', encoding='utf8')
for line in data:
    lang = line[:3]
    ipa_string = line[3:]
    
        lang_and_string_dict[lang] = []
    lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)

Since this pattern is rather common, you can use a defaultdict as well:

  open('file', 'r', encoding='utf8') as data:
    for line in data:
        lang = line[:3]
        ipa_string = line[3:]
    lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)

A defaultdict is a subclass of dict that uses a factory (here list ) in case a key is missing. So each time a key is queried that is not in the dictionary, we construct a list .

You can later convert such defaultdict to a dict with dict(lang_and_string_dict) .

Furthermore if you open(..) files, you better do this with a with block. Since if an exception for example arises, then the file is still properly closed.

a simple approach using dict where keys are lang and values are counters of word occurrences

from collections import Counter, defaultdict

lang_and_string_dict = defaultdict(Counter)
with open('file', 'r', encoding='utf8') as f:
    for line in f:
        lang, word = line.split()
        lang_and_string_dict[lang].update([word])


print(lang_and_string_dict)

output

defaultdict(<class 'collections.Counter'>, {'eng': Counter({'word1': 1, 'word2': 1, 'word3': 1}), 'ita': Counter({'word1': 1, 'word2': 1}), 'fra': Counter({'word1': 1})})

Keep in mind the line lang, word = line.split() can cause an error or unexpected behaviour if the lines in the file aren't in exact lang word format, a exception and check is suggested

Another workaround would be using collections.Counter . It returns a count of numbers of words under each category:

from collections import Counter

words = []
with open('file') as f:
    for line in f:
        words.append(line.split()[0])

print(Counter(words))
# Counter({'eng': 3, 'ita': 2, 'fra': 1})

To get count of each word under each category:

from collections import Counter

words = []
with open('file.txt') as f:
    lines = f.readlines()
    prev = lines[0].split()[0]
    for line in lines:
        splitted = line.split()
        if splitted[0] != prev:
            print('{} -> {}'.format(prev, Counter(words)))
            prev = splitted[0]
            words = []
        words.append(line.split()[1])

print('{} -> {}'.format(prev, Counter(words)))

# eng -> Counter({'word1': 1, 'word2': 1, 'word3': 1})
# ita -> Counter({'word1': 1, 'word2': 1})                         
# fra -> Counter({'word1': 1})                            

Similar solution to @shahaf's, but using defaultdict(int) instead of Counter .

I also use csv.DictReader to make the logic clearer.

from collections import defaultdict
import csv
from io import StringIO

mystr = StringIO("""eng word1
eng word2
eng word3
eng word1
ita word1
ita word2
ita word2
fra word1""")

d = defaultdict(lambda: defaultdict(int))

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    reader = csv.DictReader(fin, delimiter=' ', fieldnames=['language', 'word'])
    for line in reader:
        d[line['language']][line['word']] += 1

print(d)

defaultdict({'eng': defaultdict(int, {'word1': 2, 'word2': 1, 'word3': 1}),
             'ita': defaultdict(int, {'word1': 1, 'word2': 2}),
             'fra': defaultdict(int, {'word1': 1})})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM