简体   繁体   中英

How to create a dictionary of lists with unique mentions from a tab-delimited csv file using boolean values

I have a big tab-deltimited csv file: first tab is for emotion words, second for eight basic emotions, plus the values positive and negative , and the last tab is the boolean value if the the second tab-value fits the first.

A snippet from the file:

snarl   anger   1
snarl   anticipation    0
snarl   disgust 1
snarl   fear    0
snarl   joy 0
snarl   negative    1
snarl   positive    0
snarl   sadness 0
snarl   surprise    0
snarl   trust   0
snarling    anger   1
snarling    anticipation    0
snarling    disgust 0
snarling    fear    0
snarling    joy 0
snarling    negative    1
snarling    positive    0
snarling    sadness 0
snarling    surprise    0
snarling    trust   0

My code so far to do this:

import csv
from pprint import pprint
from itertools import groupby

l = list(csv.reader(open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt')))
f = lambda x: x[-1] #manipulate number to see different results
{k:[tuple(x[0:1]) for x in v] for k,v in groupby(sorted(l[1:], key=f), f)}

pprint(l)

My current output is not that good looking:

['asylum\tanger\t0'],
 ['asylum\tanticipation\t0'],
 ['asylum\tdisgust\t0'],
 ['asylum\tfear\t1'],
 ['asylum\tjoy\t0'],
 ['asylum\tnegative\t1'],
 ['asylum\tpositive\t0'],
 ['asylum\tsadness\t0'],
 ['asylum\tsurprise\t0'],
 ['asylum\ttrust\t0'],

My question is: How do I create a dictionary of lists with one unique key for each of the repeated emotion words (reducing 10 repetitions to 1, each) and only include the second tab elements in the list of that dictionary key, when they have the boolean value of 1?

Any kind of help would be appreciated!

EDIT: as one of the replies stated, an example of the desired output would look like this:

{'snarl': ['anger', 'disgust'], #included in list due to having '1', ignoring 'positve' and 'negative'
 'snarling': ['anger'], #etc...
}

EDIT 2:

The first and the last lines of the file are empty, as I mentioned in the answers per comments.

This is one approach. Using defaultdict

Ex:

import csv
from collections import defaultdict

d = defaultdict(list)
with open(filename) as infile:
    reader = csv.reader(infile, delimiter="\t")
    for row in reader:
        if row[2] == '1':
            d[row[0]].append(row[1])
print(d)

Edit as per comment

from collections import defaultdict

d = defaultdict(list)
with open(filename) as infile:
    for row in infile:
        if row.strip():
            val = row.split()
            if val[2] == '1':
                d[val[0]].append(val[1])
print(d)

You can use collections.defaultdict and update a dictionary of lists while iterating a csv.reader object.

Your criterion is added in an if statement, taking care to convert the number to an integer via int .

import csv
from collections import defaultdict
from io import StringIO

x = StringIO("""snarl   anger   1
snarl   anticipation    0
...
snarling    surprise    0
snarling    trust   0""")

d = defaultdict(list)

# replace x with open('file.csv', 'r')
with x as fin:
    reader = filter(None, csv.reader(x, delimiter=' ', skipinitialspace=True))
    # or, reader = filter(None, csv.reader(x, delimiter='\t'))
    for word, emotion, num in reader:
        if int(num):
            d[word].append(emotion)

Result:

print(d)

defaultdict(list,
            {'snarl': ['anger', 'disgust', 'negative'],
             'snarling': ['anger', 'negative']})

I guess you were almost close to the answer. But when you invoked csv.reader, you didn't specify delimiter (which means it defaulted to comma as delimiter)

>>> from itertools import groupby
>>> l = map(str.split, open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt').readlines())
>>> f = lambda x: x[1]
>>> {k:set(e[0] for e in v) for k,v in groupby(sorted(filter(bool, l), key=f), f)}
{'anger': {'snarling', 'snarl'}, 'anticipation': {'snarling', 'snarl'}, 'disgust': {'snarling', 'snarl'}, 'fear': {'snarling', 'snarl'}, 'joy': {'snarling', 'snarl'}, 'negative': {'snarling', 'snarl'}, 'positive': {'snarling', 'snarl'}, 'sadness': {'snarling', 'snarl'}, 'surprise': {'snarling', 'snarl'}, 'trust': {'snarling', 'snarl'}}

Here's how I would do it. You could also use collections.defaultdict if you wished (instead of setdefault ):

import csv

with open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', newline='') as file:
    l = [row[:-1] for row in csv.reader(file, delimiter='\t')
            if row and row[-1] == '1']  # Not empty and last elem is true.

d = {}
for e_word, basic in l:
    d.setdefault(e_word, []).append(basic)

print('dictionary d:\n', d)

Output:

dictionary d:
 {'snarl': ['anger', 'disgust', 'negative'], 'snarling': ['anger', 'negative']}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM