简体   繁体   English

Python字典,其中包含每个键的值列表

[英]Python Dictionary with a list of values for each key

I have two different text files: One with words and their frequencies that looks like: 我有两个不同的文本文件:一个包含单词及其频率的文件,如下所示:

word1<space>frequency

Second one is a file that has a word in the first place followed by its associated features. 第二个是一个文件,该文件的开头是一个单词,后跟它的相关功能。 It looks like: 看起来像:

word1<tab>feature1<tab>feature2................

Every word in the second file may have any number of features (ranging from 0-7 in my case) 第二个文件中的每个单词都可以具有任意数量的功能(在我的情况下,范围为0-7)

For every word in file 1, I want all the features associated with it from file 2. I want to create a dictionary where the key is the word from file 1 and its corresponding value is a list of features obtained from file 2. 对于文件1中的每个单词,我都希望从文件2中获得与其相关的所有功能。我想创建一个词典,其中的键是文件1中的单词,其对应值是从文件2中获得的功能列表。

Also, I want unique features and want to eliminate duplicates from file 2 (I have not implemented it yet). 另外,我想要独特的功能,并希望消除文件2中的重复项(我尚未实现它)。

I have the following code but it gives the desired output only for the first word in file 1. mydict does contain all the other words from file 1 but they don't have any values associated with them. 我有以下代码,但它只为文件1中的第一个单词提供所需的输出mydict确实包含文件1中的所有其他单词,但它们没有任何关联的值。

mydict = dict()

with open('sample_word_freq_sorted.txt', 'r') as f1:
        data = f1.readlines()

with open('sample_features.txt', 'r') as f2:
        for item in data:
                root = item.split()[0]
                mylist = []
                for line in f2:
                        words = line.split()
                        if words[0] == root:
                                mylist.append(words[1:])
                mydict[root] = mylist

Also, the values for each key are different lists and not just one list which is not what I want. 此外,每个键的值是不同的列表,而不仅仅是一个不是我想要的列表。 Can someone please help me with the bug is in my code? 有人可以帮我解决我代码中的错误吗?

mydict = dict()

with open('sample_word_freq_sorted.txt', 'r') as f1:
        data = set([ line.split()[0] for line in f1])

with open('sample_features.txt', 'r') as f2:
        for line in f2:
            word = line.split(' ')[0].strip()
            if word in data:
               mydict[word] = mydict.get(word,[]) + line.split(' ')[1:]

I think your most robust way would be to use Pandas and merge. 我认为您最可靠的方法是使用熊猫并合并。

df1 = pd.read_csv('sample_word_freq_sorted.txt', delim_whitespace=True)
df2 = pd.read_csv('sample_features.txt', delimeter='\t')
df2 = df2.drop_duplicates()

df = df1.merge(df2, how='left', on='word')

Obviously that needs to be customized for the bits of your data not posted, but this would be much less prone to problems than trying to customize everything in a loop. 显然,需要针对未发布的数据位进行自定义,但这比尝试自定义循环中的所有内容要容易得多。 It also handles your duplicate problem easily. 它还可以轻松处理重复的问题。

Whether this is the right solution also depends on what you want to do with the result - it may be that getting the dictionary version to work would be better in some situations. 这是否是正确的解决方案还取决于您要对结果执行的操作-在某些情况下,使字典版本正常工作可能会更好。

Edit: When your data has no column headers, you can let Pandas just give them names, which will be integers starting with 0: 编辑:当您的数据没有列标题时,您可以让Pandas为其命名,该名称将是从0开始的整数:

pd.read_csv(path, headers=None)

Then you can use the integers (eg df[0] will reference the first column named 0) or change the headers later, for example by assigning directly to df.columns = ['foo', 'bar', baz'] , or you can specify the headers in the loading: 然后,您可以使用整数(例如df [0]将引用名为0的第一列)或在以后更改标头,例如,通过直接分配给df.columns = ['foo', 'bar', baz']或您可以在加载中指定标题:

pd.read_csv(path, names=['foo', 'bar', baz'])

A file is an iterator meaning you can only iterate over it once: 文件是一个迭代器,这意味着您只能对其进行一次迭代:

>>> x = (i for i in range(3)) #example iterator
>>> for line in x:
    print(line)

0
1
2
>>> for line in x: #second time produces no results.
    print(line)

>>> 

So the loop for line in f2: only produces values for the first time it is used (the first iteration of for item in data: ) To fix this you can either do f2 = f2.readlines() so you have a list that can be traversed more then once or find a way to construct your dictionary with only one iteration of f2 . 因此for line in f2:循环for line in f2:仅在首次使用时才产生值( for item in data:的第一次迭代f2 = f2.readlines()要解决此问题,您可以执行f2 = f2.readlines()以便有一个列表可以然后再遍历一次,或者找到一种仅用f2迭代来构造字典的方法。

Then you get a list of sublists because you .append() each list of words to mylist , instead of .extend ing it by the additional words, so just changing: 然后,您将获得一个子列表列表,因为您将每个单词列表.append()都添加到mylist ,而不是通过附加单词.extend扩展,因此只需进行以下更改:

mylist.append(words[1:])

to

mylist.extend(words[1:])

Should fix the other issue you are having. 应该解决您遇到的其他问题。


This seems like a case where collections.defaultdict would come in handy, instead of going over the file many times adding items for each specific word the dict will automatically make empty lists for each new word, this would let you write your code something like this: 这似乎是一种情况,其中collections.defaultdict会派上用场,而不是遍历文件多次为每个特定单词添加项目,而dict将自动为每个新单词创建空列表,这将使您可以像这样编写代码:

import collections
mydict = collections.defaultdict(list)

with open('sample_features.txt', 'r') as f2:
    for line in f2:
        tmp = line.split()
        root = tmp[0]
        words = tmp[1:]
        #in python 3+ we can use this notation instead of the above three lines:
        #root, *words = line.split()
        mydict[root].extend(words)

Although since you want to keep only unique features it would make more sense to use set s instead of list s since they -by definition- only contain unique elements, then instead of using .extend you would use .update : 尽管由于只想保留唯一的功能,所以使用set而不是list会更有意义,因为它们(按定义)仅包含唯一的元素,然后使用.extend而不是.update

import collections
mydict = collections.defaultdict(set)
   ....
        mydict[root].update(words)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM