简体   繁体   中英

How to convert an output file into an array

This might be a trivial question, but I can't seem to find a good solution.

I have the output of a program in the format "output.file". It looks like this:

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
.
.
.

The output has over 6000 rows (one row for each pdb file) and I am trying to convert this into an array in the format [6000,35], so that every row contains the data of a new file (here in the example those would be the three files "3cp0FH_A.pdb, "1xhdFH_A.pdb" and "3c18FH_A.pdb") and every column would be one data point of the file (except the first 4 columns). The first row of the array would, taking the example above, look like this:

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

I already figured out how to get the output.file as a list where every entry is one row of the output.file. I was even able to separate the values by commas. So if i'd type in:

>>> list[0]

I'd get:

'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'

What I can't figure out is how to convert this list into an array so that each string/value that is separated by a comma is in it's own column.

So right now your list indices are strings and what you actually want is for them to be lists containing all your data points. To do that you can do the following:

for i in range(len(input_list)):
    new_row = input_list[i].split(',')
    # Optionally, translate the numbers from column 4 on to floats
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

This would modify your list in place so that it replaces whatever was in it before. This is also a pure python solution, not involving numpy (though this should give you some ideas on how to get to a numpy solution if desired).

Copy-n-paste your sample:

In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
 ...
    ...: """

simplest load:

In [27]: np.genfromtxt(txt.splitlines())                                        
Out[27]: 
array([[        nan,         nan,  1.0000e+00,  6.2000e+01,  7.5635e+01,
         8.9632e+01,  1.9255e+00,  1.9154e+02,  5.2270e+01,  1.7820e+02,
        -9.6401e+01, -3.8095e+01,  1.5210e+02, -5.4532e+01,  2.6628e+01,
        -1.0989e+01, -8.1933e+01, -6.6642e-01,  1.8158e+01,  2.2515e+01,
        -5.9261e+00,  6.8567e+00,  7.2896e+00,  1.2575e+01, -1.1400e+01,
         1.7467e+01,  4.1609e+00, -6.0523e+00, -1.8691e+01,  3.5305e+01,
         4.0516e+00,  2.9715e+00,  1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape                                                                
Out[28]: (3, 35)

The default load format is float, so the intial 2 columns are rendered as nan . loadtxt would throw an error for those entries.

You could separate out the integer column with:

In [32]: Out[27][:,2]                                                           
Out[32]: array([1., 3., 5.])

and the float data columns with:

In [33]: Out[27][:,2:].shape                                                    
Out[33]: (3, 33)

With usecols you could load the label columns separately:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)                                                                   
Out[35]: 
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
       ('3c18FH_A.pdb', 'A', 5)],
      dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM