简体   繁体   English

awk / grep特定列的某些部分

[英]awk/grep certain parts of a specific column

I have a question that I am at a loss to solve. 我有一个问题,我无法解决。 I have 3 column tab-separated data, such as: 我有3列制表符分隔的数据,例如:

abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs into+ns-j+vn-pass-rb-divide-v   295.57316
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

I need to isolate the last two "elements" of the third column, in which my desired result would be a 4-column output that only contains those elements that end with "-n". 我需要隔离第三列的最后两个“元素”,其中我想要的结果是一个4列输出,它只包含那些以“-n”结尾的元素。

such as: 如:

abs nmod+n+n   commitment-n   349.200023
abs nmod+n+n-a   commitment-n 333.306429
abs nmod+n+ns   commitment-n  182.085018
abs nmod+n+n   pledge-n   149.927391
abs nmod+n+ns   reagent-n 142.347358

In this case, is there an awk , grep anything that can help? 在这种情况下, awkgrep是否可以提供帮助? The files are approx. 这些文件是大约。 500 MB, so they are not huge, but not small either. 500 MB,所以它们不是很大,但也不小。 Thanks for any insight. 感谢您的任何见解。

With this you can check if the 2nd column ends with -n and then print the lines: 这样,您可以检查第二列是否以-n结尾,然后打印以下行:

$ awk '$2~/-n$/' file
abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

To have the second field splitted so that the last two elements are isolated, you can use: 要分割第二个字段以便隔离最后两个元素,您可以使用:

awk 'BEGIN{OFS=FS="\t"}
     $2~/-n$/ {
               size=split($2,a,"-");
               for (i=1; i<=size-2; i++) first=first"-"a[i];
               second=a[size-1]"-"a[size];
               print $1,first,second,$3;
               first=second=""
              }' file

which returns 返回

$ awk 'BEGIN{OFS=FS="\t"} $2~/-n$/ {size=split($2,a,"-"); for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]; print $1,first,second,$3; first=second=""}' file
abs     -nmod+n+n       commitment-n    349.200023
abs     -nmod+n+n-a     commitment-n    333.306429
abs     -nmod+n+ns      commitment-n    182.085018
abs     -nmod+n+n       pledge-n        149.927391
abs     -nmod+n+ns      reagent-n       142.347358

Explanation 说明

  • BEGIN{OFS=FS="\\t"} set tab as input an output field separator. BEGIN{OFS=FS="\\t"}设置选项卡作为输入,输出字段分隔符。
  • $2~/-n$/ {} match lines in which the 2nd field ends with "-n" and do the things within {} . $2~/-n$/ {}匹配第二个字段以“ -n”结尾的行,并执行{}
  • size=split($2,a,"-") cut the 2nd field in pieces based on the - delimiter and save them in the a[] array. size=split($2,a,"-")根据-分隔符将第二个字段分割为几部分,并将其保存在a[]数组中。 Store the size of the array in size var. 将数组的size存储在size var中。
  • for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] save the data in two different blocks: first everything up to the 2nd last field; for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]将数据保存在两个不同的块中:首先是所有内容,直到倒数第二个字段; then, the two last fields. 然后,最后两个字段。
  • print $1,first,second,$3 print everything. print $1,first,second,$3打印所有内容。
  • first=second="" unset the variables. first=second=""取消设置变量。

give this one-liner a try: (gawk) 试试这个单线:( gawk)

awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' file

output with your file (as f ): 用你的文件输出(如f ):

kent$  awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' f
abs     nmod+n+n        commitment-n    349.200023
abs     nmod+n+n-a      commitment-n    333.306429
abs     nmod+n+ns       commitment-n    182.085018
abs     nmod+n+n        pledge-n        149.927391
abs     nmod+n+ns       reagent-n       142.347358

Using sed : 使用sed

sed -r -n '/-n\t[0-9.]*$/{s/(\S+)\t(.*)-([^-]+-\S+)\t(.*)/\1\t\2\t\3\t\4/p}' filename

For your input, it'd produce: 对于您的输入,它将产生:

abs nmod+n+n    commitment-n    349.200023
abs nmod+n+n-a  commitment-n    333.306429
abs nmod+n+ns   commitment-n    182.085018
abs nmod+n+n    pledge-n    149.927391
abs nmod+n+ns   reagent-n   142.347358

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM