简体   繁体   English

如何将输出文件转换为数组

[英]How to convert an output file into an array

This might be a trivial question, but I can't seem to find a good solution.这可能是一个微不足道的问题,但我似乎找不到一个好的解决方案。

I have the output of a program in the format "output.file".我有“output.file”格式的程序输出。 It looks like this:它看起来像这样:

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
.
.
.

The output has over 6000 rows (one row for each pdb file) and I am trying to convert this into an array in the format [6000,35], so that every row contains the data of a new file (here in the example those would be the three files "3cp0FH_A.pdb, "1xhdFH_A.pdb" and "3c18FH_A.pdb") and every column would be one data point of the file (except the first 4 columns). The first row of the array would, taking the example above, look like this:输出有超过 6000 行(每个 pdb 文件一行),我试图将其转换为 [6000,35] 格式的数组,以便每一行都包含一个新文件的数据(在示例中为将是三个文件“3cp0FH_A.pdb、“1xhdFH_A.pdb”和“3c18FH_A.pdb”),每一列都是文件的一个数据点(前4列除外)。数组的第一行将上面的例子,看起来像这样:

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

I already figured out how to get the output.file as a list where every entry is one row of the output.file.我已经想出了如何将 output.file 作为列表获取,其中每个条目都是 output.file 的一行。 I was even able to separate the values by commas.我什至能够用逗号分隔值。 So if i'd type in:所以如果我输入:

>>> list[0]

I'd get:我会得到:

'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'

What I can't figure out is how to convert this list into an array so that each string/value that is separated by a comma is in it's own column.我无法弄清楚如何将此列表转换为数组,以便由逗号分隔的每个字符串/值都在它自己的列中。

So right now your list indices are strings and what you actually want is for them to be lists containing all your data points.所以现在你的列表索引是字符串,你真正想要的是它们是包含所有数据点的列表。 To do that you can do the following:为此,您可以执行以下操作:

for i in range(len(input_list)):
    new_row = input_list[i].split(',')
    # Optionally, translate the numbers from column 4 on to floats
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

This would modify your list in place so that it replaces whatever was in it before.这将修改您的列表,以便它替换之前的任何内容。 This is also a pure python solution, not involving numpy (though this should give you some ideas on how to get to a numpy solution if desired).这也是一个纯 python 解决方案,不涉及 numpy(尽管如果需要,这应该会给你一些关于如何获得 numpy 解决方案的想法)。

Copy-n-paste your sample:复制粘贴您的示例:

In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
 ...
    ...: """

simplest load:最简单的负载:

In [27]: np.genfromtxt(txt.splitlines())                                        
Out[27]: 
array([[        nan,         nan,  1.0000e+00,  6.2000e+01,  7.5635e+01,
         8.9632e+01,  1.9255e+00,  1.9154e+02,  5.2270e+01,  1.7820e+02,
        -9.6401e+01, -3.8095e+01,  1.5210e+02, -5.4532e+01,  2.6628e+01,
        -1.0989e+01, -8.1933e+01, -6.6642e-01,  1.8158e+01,  2.2515e+01,
        -5.9261e+00,  6.8567e+00,  7.2896e+00,  1.2575e+01, -1.1400e+01,
         1.7467e+01,  4.1609e+00, -6.0523e+00, -1.8691e+01,  3.5305e+01,
         4.0516e+00,  2.9715e+00,  1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape                                                                
Out[28]: (3, 35)

The default load format is float, so the intial 2 columns are rendered as nan .默认加载格式为 float,因此最初的 2 列呈现为nan loadtxt would throw an error for those entries. loadtxt会为这些条目抛出错误。

You could separate out the integer column with:您可以使用以下命令分离整数列:

In [32]: Out[27][:,2]                                                           
Out[32]: array([1., 3., 5.])

and the float data columns with:和浮动数据列:

In [33]: Out[27][:,2:].shape                                                    
Out[33]: (3, 33)

With usecols you could load the label columns separately:使用usecols您可以单独加载标签列:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)                                                                   
Out[35]: 
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
       ('3c18FH_A.pdb', 'A', 5)],
      dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM