简体   繁体   English

Python中的复杂数据处理

[英]Complex Data Manipulation in Python

I have 3 files with real data and pseudo data and value of real data. 我有3个文件,包含真实数据和伪数据以及真实数据的值。

File_one has two columns with one column as real data and the second column as the translational data. File_one有两列,其中一列为实际数据,第二列为转换数据。 Ie For the real data a pseudo value is given. 即,对于真实数据,给出伪值。

col[0] col[1]
123     0
234     1
345     2
456     3
567     4
678     5

File_two has pairs of pseudo values ie In place of 123 the value used is 0 and the same way the pseudo value pairs as [0, 1] which means [123, 234] in real. File_two具有成对的伪值,即代替123所使用的值是0并且伪值对与[0, 1] ,这实际上意味着[123, 234]

col[0]  col[1]
0        2
0        3
0        5
2        4
5        1

So can say that col[0] and col[1] of file_two are the key and the value is in file_one col[0] 因此可以说, file_two col[0] and col[1]是键,并且值在file_one col[0]

Now I have to match the pseudo value pairs from file_two with the real data col[0] in file_one and get an output saving it to new file. 现在,我必须将file_two中的伪值对与file_two中的实际数据col[0] file_one并获得将其保存到新文件的输出。 We name it file_four . 我们将其命名为file_four Here pairs occur only ONE time. 这对只发生ONE时间。

col[0]  col[1]
123     345
123     456
123     678
345     567
678     234

Now file_three comes into the picture. 现在, file_three进入了图片。 File_three has the 3 columns. File_three具有3列。

col[0] and col[1] are the same pairs as in file_four but they also have many other pairs that are not present in file_four . col[0]col[1]是相同的对在file_four但他们也有许多其它的对不存在于file_four

File_three File_three

col[0]  col[1]  col[2]
123     345       54
345     262       65
123     456       54
2456    2467      98
123     678       46
7845    2458      631
345     567       153
3456    3673      94
678     234       5

Finally, I need to match the pairs of file_four ie col[0] col[1] and pull the value from col[2] in file_three and generate a new output_file with the pairs of file_four as key and the value in col[2] of file_three . 最后,我需要匹配成对的file_fourcol[0] col[1]并从file_three col[2]file_three值,并生成一个新的output_file其中以成对的file_four作为键,而col[2]的值的file_three

In the following code I am trying to only consider first two files 在下面的代码中,我试图仅考虑前两个文件

from collections import defaultdict

d1 = dict()
d2 = dict()

with open('input1.txt', 'r') as file1:
    for row in file1:
        c0, c1 = row.split()[:2]
        d1[c1] = c0
with open('input2.txt', 'r') as file2:
    for row in file2:
        c0, c1 = row.split()[:2]
        d2[(c0, c1)] = [d1[c1], d1[c1]]

#for k, v in sorted(d2.items()):
    #print '\t'.join(v)
print d2

Error:

Key Error: 'key' 

Its the same error even if for loop is not commented and last print is commented. 即使未注释for循环且注释了最后一次打印,也存在相同的错误。

You don't have matching keys because d1 contains pairs as keys, while d2 contains single values. 您没有匹配的键,因为d1包含对作为键,而d2包含单个值。

This line looks like it is wrong: 这行看起来是错误的:

    key =  col[0], col[1]

For d1 , use file1 column 1 for the keys and column 0 for the values creating a lookup table: 对于d1 ,将file1的第1列用作键,将第0列用作创建查找表的值:

f1 = [(123,0),(234,1),(345,2),(456,3),(567,4),(678,5)]
f2 = [(0,2),(0,3),(0,5),(2,4),(5,1)]

d1 = {c1:c0 for c0,c1 in f1}

That allows you to use file2 column values to look up the values in d1 这使您可以使用file2列值来查找 d1的值

d2 = {(c0, c1):[d1[c0], d1[c1]] for c0, c1 in f2}
print d2

>>>
{(5, 1): [678, 234], (0, 3): [123, 456], (0, 5): [123, 678], (0, 2): [123, 345], (2, 4): [345, 567]}
>>>

Your code for file 1 and file 2 refactored : 您的文件1和文件2的代码已重构

d1, d2 = dict(), dict()
with open('inputfile1.txt', 'r') as file1:
    for row in file1:
        c0, c1 = row.strip().split()[:2]
        d1[c1] = c0

with open('inputfile2.txt', 'r') as file2:
    for row in file2:
        c0, c1 = row.strip().split()[:2]
        d2[(c0, c1)] = [d1[c0], d1[c1]]

>>> for k, v in sorted(d2.items()):
    print '\t'.join(v)


123 345
123 456
123 678
345 567
678 234
>>> 

Unpacking values/items during assignment: 在分配过程中解包值/项目:

>>> 
>>> x, y, z = [1, 2, 3]
>>> print x, y, z
1 2 3
>>> x, y = [1, 2, 3]

Traceback (most recent call last):
  File "<pyshell#259>", line 1, in <module>
    x, y = [1, 2, 3]
ValueError: too many values to unpack
>>> 
>>> a, b, _, _, _, _ = '1 2 3 4 5 6'.split()
>>> print a, b, _
1 2 6
>>> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM