简体   繁体   English

awk或sed命令可解析多个文件中的数据,并使用特定格式进行合并

[英]awk or sed command to parse data from multiple file and combine them using a specific format

I need to parse the output of a chemistry program run with different parameters and combine the information of interest in a specific format. 我需要解析以不同参数运行的化学程序的输出,并以特定格式组合感兴趣的信息。

Each output file from the program look like the following table, it gives the population of protonated and unprotonated species (residue) at a particular pH (here it is at pH=0): 该程序的每个输出文件如下表所示,它给出了在特定pH值(此处为pH = 0)下的质子化和非质子化物质(残基)的数量:

   Residue Number     State  0     State  1     State  2     State  3     State  4
-----------------------------------------------------------------------------------
Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)
Residue: HIP 15   0.900000 (2) 0.080000 (1) 0.020000 (1)
Residue: AS4 18   0.010085 (0) 0.486042 (1) 0.004335 (1) 0.495922 (1) 0.003615 (1)
Residue: GL4 35   0.000000 (0) 0.581343 (1) 0.000360 (1) 0.368002 (1) 0.050295 (1)
Residue: AS4 48   0.022640 (0) 0.520073 (1) 0.018440 (1) 0.425152 (1) 0.013695 (1)
Residue: AS4 52   0.038725 (0) 0.517533 (1) 0.113676 (1) 0.280601 (1) 0.049465 (1)
Residue: AS4 66   1.000000 (0) 0.000000 (1) 0.000000 (1) 0.000000 (1) 0.000000 (1)
Residue: AS4 87   0.004295 (0) 0.439747 (1) 0.010535 (1) 0.524678 (1) 0.020745 (1)
Residue: AS4 101  0.000105 (0) 0.504673 (1) 0.013110 (1) 0.478517 (1) 0.003595 (1)
Residue: AS4 119  0.014240 (0) 0.488767 (1) 0.007100 (1) 0.483272 (1) 0.006620 (1)

I have one file like this for each pH (all files have the exact same residues and states, only the population changes). 对于每个pH,我都有一个这样的文件(所有文件都具有完全相同的残基和状态,只是种群发生了变化)。 Now I would like to extract the deprotonated fraction for all residues. 现在,我想提取所有残留物的去质子化部分。 The deprotonated fraction correspond to the populations that have a (0) after their number: for example, in the case of GL4 7 at pH=0 it is 0.000410 (which correspond to state 0) and for AS4 66, it is 1.00000. 去质子化的分数对应于在其数字后具有(0)的种群:例如,在GL4 7在pH = 0的情况下,它是0.000410(对应于状态0),而在AS4 66的情况下,它是1.00000。 In fact it is state 0 for all residue EXCEPT for HIP 15: in this case the deprotonated fraction is indicated with (1) and corresponds to state 1 and 2. In the example above it is 0.080000 + 0.020000 = 0.1. 实际上,对于HIP 15而言,所有残基的状态均为0:在这种情况下,去质子化的部分用(1)表示,对应于状态1和状态2。在上面的示例中,其为0.080000 + 0.020000 = 0.1。

I then need to combine this information from the different files into a single file which look like this: 然后,我需要将来自不同文件的信息组合成一个文件,如下所示:

#     pH     GLU7    HIS15    ASP18    GLU35    ASP48    ASP52    ASP66    ASP87   ASP101   ASP119
   0.000    0.000    0.100    0.010    0.000    0.023    0.039    1.000    0.004    0.000    0.014
   1.000    0.006    0.140    0.098    0.000    0.276    0.312    1.000    0.015    0.002    0.069

Each column correspond to a residue, and each row to a pH (ie the information from a single file, here I just show the information from two files). 每列对应一个残基,每行对应一个pH(即来自一个文件的信息,这里我仅显示来自两个文件的信息)。

I tried to come up with some awk one-liner but I am a beginner and I am not sure how to proceed. 我试图提出一些awk单线,但我是一个初学者,我不确定如何进行。 Actually, I don't know if awk is the best tool for this job. 实际上,我不知道awk是否是这项工作的最佳工具。 Perhaps sed and grep or python would be better. 也许sed和grep或python会更好。 I will need to do this kind of parsing several time with a number of different outputs (but which all look the same although the residues will change) so I would like to have a way to make this automated but with some flexibility. 我将需要使用多种不同的输出进行多次这种解析(尽管残差会改变,但它们看起来都是一样的),所以我希望有一种方法可以使其自动化但具有一定的灵活性。

Please do not hesitate if you have any suggestion or comments, I would really appreciate if you can help me in sorting this problem. 如果您有任何建议或意见,请不要犹豫,如果您能帮助我解决此问题,我将不胜感激。

Many thanks in advance! 提前谢谢了!

you can cat all the files using a for loop to a file and use the previous solution from Stackoverflow to transpose the row to column. 您可以使用for循环将所有文件分类为一个文件,并使用Stackoverflow先前的解决方案将行转置为列。

An efficient way to transpose a file in Bash 在Bash中转置文件的有效方法

It's not completely clear what you want but python's split function could possibly be of use to you. 目前尚不清楚您想要什么,但是python的split函数可能对您有用。 If called without any arguments, it splits based on spaces (collating multiple spaces into one) 如果在不带任何参数的情况下调用,它将基于空格进行拆分(将多个空格对齐为一个)

So this line for example, 以这行为例

Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)

can be split like this, 可以这样分割

a = 'Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)'
l = a.split()
print l

['Residue:', 'GL4', '7', '0.000410', '(0)', '0.453512', '(1)', '0.004275', '(1)', '0.535908', '(1)', '0.005895', '(1)']

You can then access the values you want and work on them. 然后,您可以访问所需的值并对其进行处理。 Calling float and int on the strings (eg. float('0.00410') should convert them to numbers for you. For the '(1)', you can do int('(1)'[1:-1]) 在字符串上调用float和int(例如,float('0.00410')会为您将它们转换为数字。对于'(1)',您可以执行int('(1)'[1:-1])

This awk script should get you started. 这个awk脚本应该可以帮助您入门。 In order to get the desired output, you will have to replace the filename with the corresponding pH value. 为了获得所需的输出,必须将文件名替换为相应的pH值。 And I omitted lines that contain no zero state, since you did not specify what to do with those. 而且我省略了不包含零状态的行,因为您未指定如何处理这些状态。

/^   Residue/ || /^-----/ { next; }

{
    filenames[FILENAME] = 1;
    columns[$2 " " $3] = 1;
    for (i = 5; i <= NF; i = i + 2) {
        if ($i == "(0)") {
            data[$2 " " $3, FILENAME] = $(i-1);
        }
    }
}

END {
    printf("%10s", "filename");
    for (col in columns) {
        printf("%10s", col);
    }
    print "";
    for (filename in filenames) {
        printf("%10s", filename);
        for (col in columns) {
            printf("%10s", data[col, filename]);
        }
        print "";
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM