简体   繁体   English

解析字典时的python KeyError

[英]python KeyError on parsing a dictionary

How can i join these two text documents? 如何加入这两个文本文件?

document 1: 文件1:

1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713

document 2: 文件2:

1000001 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002
1000003 555:0.0585849 91:0.0164101

result: 结果:

1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713 555:0.0585849 91:0.0164101

explanation: 说明:
document 1 and document 2 both have the same structure and they have the same number of lines. 文档1文档2都具有相同的结构,并且它们具有相同的行数。
Each line starts with a number (the same number in both documents), and then we have several items in each line which are made up of a number+colon+a decimal number: 每行以一个数字(两个文档中的数字相同)开头,然后每行中有几个项目由数字+冒号+一个十进制数字组成:
example 10:0.471669 例子 10:0.471669
these item combinations are unique, what I want to do is to merge them together: take the items from the second document for each line and put it in the corresponding line of the first document. 这些项目组合是唯一的,我要做的是将它们合并在一起:从第二个文档的每一行中取出这些项目,并将其放在第一个文档的相应行中。
note: 注意:
the initial number at the beginning and the items from one another are separated by a single space. 开头的初始编号和彼此之间的项目用一个空格分隔。

update 更新

here is my try: 这是我的尝试:

dat1 = {}
with open('doc1') as f:
    for line in f.readlines():
        dat1[line.split(' ')[0]] = line.strip().split(' ')[1:]

dat2 = {}
with open('doc2') as f:
    for line in f.readlines():
        key = line.split(' ')[0]
        dat2[key] = line.split(' ')[1:]

for key in dat1.keys():
    print("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))

i get a traceback of KeyError, on the lines of the second document when the line doens't have anything to be added to the first document. 当该行没有任何内容要添加到第一个文档时,我在第二个文档的行上得到了KeyError的回溯。 It is the case in the second line of the second document in the above example. 在上面的示例中,第二个文档的第二行就是这种情况。
How can I escape this exception? 我该如何逃避此异常? escape the lines which have only the key and nothing else to add? 转义仅包含关键字而没有其他要添加的行?

An easier way might be to use a defaultdict of lists: 一种更简单的方法可能是使用列表的defaultdict

from collections import defaultdict

data = defaultdict(list)

for filename in 'doc1', 'doc2':
    with open(filename) as f:
        for line in f:
            key, _, value = line.partition(' ')
            data[key.strip()].append(value.strip())

for key in sorted(data):
    print key, ' '.join(data[key])    # Python 2
#    print(key, *data[key])            # Python 3

Regarding the printing of the result you could add: 关于结果的打印,您可以添加:

from __future__ import print_function

to the top of your file, and then the Python 3 print() function will be available in Python 2, ie you can use the Python 3 print above. 到文件顶部,然后Python 2将提供Python 3 print()函数,即,您可以使用上面的Python 3打印。


You asked in a comment how to print to a file (Python 3, or Python 2 after importing print_function ): 您在注释中询问了如何打印到文件(导入print_function之后是Python 3还是Python 2):

with open('outfile.txt', 'w') as f:
    for key in sorted(data):
        print(key, *data[key], file=f)

The problem is with newline characters. 问题在于换行符。

At the end of each line in the file there is a newline character which will be included in the last entry of each line. 文件中每一行的末尾都有一个换行符,它将包含在每一行的最后一个条目中。 The exception occurs because dat1 will have a key "1000002" and dat2 will have a key "1000002\\n" . 发生异常是因为dat1将具有键"1000002"而dat2将具有键"1000002\\n"

If you have line = line.strip() before parsing then the code should work as expected. 如果在解析之前有line = line.strip() ,则代码应该可以按预期工作。

for line in f.readlines():
    line = line.strip()
    key = line.split(' ')[0]
    dat2[key] = line.split(' ')[1:]

You can use pop operation to get the first item of an array, like this: 您可以使用pop操作获取数组的第一项,如下所示:

def read_stem(f):
        res = {}
        for line in f.readlines():
                items = line.strip().split()
                res[items.pop(0)] = items
        return res

with open('stem.data') as f:
        dat1 = read_stem(f)

with open('stem.info') as f:
        dat2 = read_stem(f)

with open('myfile','w') as f:
    for key in dat1.keys():
            f.write("%s %s\n" % (key, ' '.join(dat1[key] + dat2[key])))

In your code in 2nd file key for empty row was '1000002\\n' not 1000002, that could be the reason, this works. 在第二个文件的代码中,空行的键是'1000002 \\ n'而不是1000002,这可能是原因,这是可行的。

file1_lines= open('doc1', 'r').readlines()
file2_lines = open('doc1', 'r').readlines()
resfile = open('res.txt','w')


dat1 = {}
for line in file1_lines:
    dat1[line.split(' ')[0]] = line.strip().split(' ')[1:]

dat2 = {}
for line in file2_lines:
    dat2[line.strip().split(' ')[0]] = line.strip().split(' ')[1:]

print(dat1)
print(dat2)

for key in dat1.keys():
    print("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))
    resfile.write("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))

You can use: 您可以使用:

doc1_name = 'doc1'
doc2_name = 'doc2'

def get_key_and_value(key_value_list):
    if len(key_value_list) == 2:
        # list has key and values
        key, value = key_value_list
    elif len(key_value_list) == 1:
        # list only has key
        key, value = key_value_list[0], ''
    else:
        # should not happen!
        key, value = '', ''
    return key,value

def join_dict(key, value, _dict, sep=' '):
    if key in _dict.keys():
        _dict[key] = sep.join((_dict[key], value))
    else:
        _dict[key] = value

result = {}
with open(doc1_name, 'r') as doc1, \
     open(doc2_name, 'r') as doc2:
         doc1_lines = doc1.readlines()
         doc2_lines = doc2.readlines()

for list_of_lines in (doc1_lines, doc2_lines):
    for line in list_of_lines:
        # The .strip('\n') removes the \n at the end
        # and the .split(' ', 1) split only once
        key_value = line.strip('\n').split(' ', 1)
        # split the lines only once to get the keys:
        key, value = get_key_and_value(key_value)
        # this can be ignored if it is known that the keys will be the same
        join_dict(key, value, result)

# order the keys
ordered_keys = result.keys()
ordered_keys.sort()
# and write them to a file
with open('+'.join((doc1_name,doc2_name)),'w') as output:
    for key in ordered_keys:
        output.write(' '.join((key, result[key]))+'\n')

doc1 doc1

1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713

doc2 doc2

1000001 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002
1000003 555:0.0585849 91:0.0164101

doc1+doc2 doc1 + doc2

1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011 
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713 555:0.0585849 91:0.0164101

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM