简体   繁体   中英

awk or sed command to parse data from multiple file and combine them using a specific format

I need to parse the output of a chemistry program run with different parameters and combine the information of interest in a specific format.

Each output file from the program look like the following table, it gives the population of protonated and unprotonated species (residue) at a particular pH (here it is at pH=0):

   Residue Number     State  0     State  1     State  2     State  3     State  4
-----------------------------------------------------------------------------------
Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)
Residue: HIP 15   0.900000 (2) 0.080000 (1) 0.020000 (1)
Residue: AS4 18   0.010085 (0) 0.486042 (1) 0.004335 (1) 0.495922 (1) 0.003615 (1)
Residue: GL4 35   0.000000 (0) 0.581343 (1) 0.000360 (1) 0.368002 (1) 0.050295 (1)
Residue: AS4 48   0.022640 (0) 0.520073 (1) 0.018440 (1) 0.425152 (1) 0.013695 (1)
Residue: AS4 52   0.038725 (0) 0.517533 (1) 0.113676 (1) 0.280601 (1) 0.049465 (1)
Residue: AS4 66   1.000000 (0) 0.000000 (1) 0.000000 (1) 0.000000 (1) 0.000000 (1)
Residue: AS4 87   0.004295 (0) 0.439747 (1) 0.010535 (1) 0.524678 (1) 0.020745 (1)
Residue: AS4 101  0.000105 (0) 0.504673 (1) 0.013110 (1) 0.478517 (1) 0.003595 (1)
Residue: AS4 119  0.014240 (0) 0.488767 (1) 0.007100 (1) 0.483272 (1) 0.006620 (1)

I have one file like this for each pH (all files have the exact same residues and states, only the population changes). Now I would like to extract the deprotonated fraction for all residues. The deprotonated fraction correspond to the populations that have a (0) after their number: for example, in the case of GL4 7 at pH=0 it is 0.000410 (which correspond to state 0) and for AS4 66, it is 1.00000. In fact it is state 0 for all residue EXCEPT for HIP 15: in this case the deprotonated fraction is indicated with (1) and corresponds to state 1 and 2. In the example above it is 0.080000 + 0.020000 = 0.1.

I then need to combine this information from the different files into a single file which look like this:

#     pH     GLU7    HIS15    ASP18    GLU35    ASP48    ASP52    ASP66    ASP87   ASP101   ASP119
   0.000    0.000    0.100    0.010    0.000    0.023    0.039    1.000    0.004    0.000    0.014
   1.000    0.006    0.140    0.098    0.000    0.276    0.312    1.000    0.015    0.002    0.069

Each column correspond to a residue, and each row to a pH (ie the information from a single file, here I just show the information from two files).

I tried to come up with some awk one-liner but I am a beginner and I am not sure how to proceed. Actually, I don't know if awk is the best tool for this job. Perhaps sed and grep or python would be better. I will need to do this kind of parsing several time with a number of different outputs (but which all look the same although the residues will change) so I would like to have a way to make this automated but with some flexibility.

Please do not hesitate if you have any suggestion or comments, I would really appreciate if you can help me in sorting this problem.

Many thanks in advance!

you can cat all the files using a for loop to a file and use the previous solution from Stackoverflow to transpose the row to column.

An efficient way to transpose a file in Bash

It's not completely clear what you want but python's split function could possibly be of use to you. If called without any arguments, it splits based on spaces (collating multiple spaces into one)

So this line for example,

Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)

can be split like this,

a = 'Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)'
l = a.split()
print l

['Residue:', 'GL4', '7', '0.000410', '(0)', '0.453512', '(1)', '0.004275', '(1)', '0.535908', '(1)', '0.005895', '(1)']

You can then access the values you want and work on them. Calling float and int on the strings (eg. float('0.00410') should convert them to numbers for you. For the '(1)', you can do int('(1)'[1:-1])

This awk script should get you started. In order to get the desired output, you will have to replace the filename with the corresponding pH value. And I omitted lines that contain no zero state, since you did not specify what to do with those.

/^   Residue/ || /^-----/ { next; }

{
    filenames[FILENAME] = 1;
    columns[$2 " " $3] = 1;
    for (i = 5; i <= NF; i = i + 2) {
        if ($i == "(0)") {
            data[$2 " " $3, FILENAME] = $(i-1);
        }
    }
}

END {
    printf("%10s", "filename");
    for (col in columns) {
        printf("%10s", col);
    }
    print "";
    for (filename in filenames) {
        printf("%10s", filename);
        for (col in columns) {
            printf("%10s", data[col, filename]);
        }
        print "";
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM