Python中的复杂数据处理

Question

I have 3 files with real data and pseudo data and value of real data. 我有3个文件，包含真实数据和伪数据以及真实数据的值。

File_one has two columns with one column as real data and the second column as the translational data. File_one有两列，其中一列为实际数据，第二列为转换数据。 Ie For the real data a pseudo value is given. 即，对于真实数据，给出伪值。

col[0] col[1]
123     0
234     1
345     2
456     3
567     4
678     5

File_two has pairs of pseudo values ie In place of 123 the value used is 0 and the same way the pseudo value pairs as [0, 1] which means [123, 234] in real. File_two具有成对的伪值，即代替123所使用的值是0并且伪值对与[0, 1] ，这实际上意味着[123, 234] 。

col[0]  col[1]
0        2
0        3
0        5
2        4
5        1

So can say that col[0] and col[1] of file_two are the key and the value is in file_one col[0] 因此可以说， file_two col[0] and col[1]是键，并且值在file_one col[0]

Now I have to match the pseudo value pairs from file_two with the real data col[0] in file_one and get an output saving it to new file. 现在，我必须将file_two中的伪值对与file_two中的实际数据col[0] file_one并获得将其保存到新文件的输出。 We name it file_four . 我们将其命名为file_four 。 Here pairs occur only ONE time. 这对只发生ONE时间。

col[0]  col[1]
123     345
123     456
123     678
345     567
678     234

Now file_three comes into the picture. 现在， file_three进入了图片。 File_three has the 3 columns. File_three具有3列。

col[0] and col[1] are the same pairs as in file_four but they also have many other pairs that are not present in file_four . col[0]和col[1]是相同的对在file_four但他们也有许多其它的对不存在于file_four 。

File_three File_three

col[0]  col[1]  col[2]
123     345       54
345     262       65
123     456       54
2456    2467      98
123     678       46
7845    2458      631
345     567       153
3456    3673      94
678     234       5

Finally, I need to match the pairs of file_four ie col[0] col[1] and pull the value from col[2] in file_three and generate a new output_file with the pairs of file_four as key and the value in col[2] of file_three . 最后，我需要匹配成对的file_four即col[0] col[1]并从file_three col[2]中file_three值，并生成一个新的output_file其中以成对的file_four作为键，而col[2]的值的file_three 。

In the following code I am trying to only consider first two files 在下面的代码中，我试图仅考虑前两个文件

from collections import defaultdict

d1 = dict()
d2 = dict()

with open('input1.txt', 'r') as file1:
    for row in file1:
        c0, c1 = row.split()[:2]
        d1[c1] = c0
with open('input2.txt', 'r') as file2:
    for row in file2:
        c0, c1 = row.split()[:2]
        d2[(c0, c1)] = [d1[c1], d1[c1]]

#for k, v in sorted(d2.items()):
    #print '\t'.join(v)
print d2

Error:

Key Error: 'key'

Its the same error even if for loop is not commented and last print is commented. 即使未注释for循环且注释了最后一次打印，也存在相同的错误。

Answer 1

You don't have matching keys because d1 contains pairs as keys, while d2 contains single values. 您没有匹配的键，因为d1包含对作为键，而d2包含单个值。

This line looks like it is wrong: 这行看起来是错误的：

    key =  col[0], col[1]

Answer 2

For d1 , use file1 column 1 for the keys and column 0 for the values creating a lookup table: 对于d1 ，将file1的第1列用作键，将第0列用作创建查找表的值：

f1 = [(123,0),(234,1),(345,2),(456,3),(567,4),(678,5)]
f2 = [(0,2),(0,3),(0,5),(2,4),(5,1)]

d1 = {c1:c0 for c0,c1 in f1}

That allows you to use file2 column values to look up the values in d1 这使您可以使用file2列值来查找 d1的值

d2 = {(c0, c1):[d1[c0], d1[c1]] for c0, c1 in f2}
print d2

>>>
{(5, 1): [678, 234], (0, 3): [123, 456], (0, 5): [123, 678], (0, 2): [123, 345], (2, 4): [345, 567]}
>>>

Your code for file 1 and file 2 refactored : 您的文件1和文件2的代码已重构 ：

d1, d2 = dict(), dict()
with open('inputfile1.txt', 'r') as file1:
    for row in file1:
        c0, c1 = row.strip().split()[:2]
        d1[c1] = c0

with open('inputfile2.txt', 'r') as file2:
    for row in file2:
        c0, c1 = row.strip().split()[:2]
        d2[(c0, c1)] = [d1[c0], d1[c1]]

>>> for k, v in sorted(d2.items()):
    print '\t'.join(v)


123 345
123 456
123 678
345 567
678 234
>>>

Unpacking values/items during assignment: 在分配过程中解包值/项目：

>>> 
>>> x, y, z = [1, 2, 3]
>>> print x, y, z
1 2 3
>>> x, y = [1, 2, 3]

Traceback (most recent call last):
  File "<pyshell#259>", line 1, in <module>
    x, y = [1, 2, 3]
ValueError: too many values to unpack
>>> 
>>> a, b, _, _, _, _ = '1 2 3 4 5 6'.split()
>>> print a, b, _
1 2 6
>>>

Python中的复杂数据处理

问题描述

2 个解决方案

解决方案1
1 2014-08-28 14:03:57

解决方案2
0 已采纳 2014-08-28 15:18:22

Python中的复杂数据处理

问题描述

2 个解决方案

解决方案1 1 2014-08-28 14:03:57

解决方案2 0 已采纳 2014-08-28 15:18:22

解决方案1
1 2014-08-28 14:03:57

解决方案2
0 已采纳 2014-08-28 15:18:22