简体   繁体   中英

awk/grep certain parts of a specific column

I have a question that I am at a loss to solve. I have 3 column tab-separated data, such as:

abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs into+ns-j+vn-pass-rb-divide-v   295.57316
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

I need to isolate the last two "elements" of the third column, in which my desired result would be a 4-column output that only contains those elements that end with "-n".

such as:

abs nmod+n+n   commitment-n   349.200023
abs nmod+n+n-a   commitment-n 333.306429
abs nmod+n+ns   commitment-n  182.085018
abs nmod+n+n   pledge-n   149.927391
abs nmod+n+ns   reagent-n 142.347358

In this case, is there an awk , grep anything that can help? The files are approx. 500 MB, so they are not huge, but not small either. Thanks for any insight.

With this you can check if the 2nd column ends with -n and then print the lines:

$ awk '$2~/-n$/' file
abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

To have the second field splitted so that the last two elements are isolated, you can use:

awk 'BEGIN{OFS=FS="\t"}
     $2~/-n$/ {
               size=split($2,a,"-");
               for (i=1; i<=size-2; i++) first=first"-"a[i];
               second=a[size-1]"-"a[size];
               print $1,first,second,$3;
               first=second=""
              }' file

which returns

$ awk 'BEGIN{OFS=FS="\t"} $2~/-n$/ {size=split($2,a,"-"); for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]; print $1,first,second,$3; first=second=""}' file
abs     -nmod+n+n       commitment-n    349.200023
abs     -nmod+n+n-a     commitment-n    333.306429
abs     -nmod+n+ns      commitment-n    182.085018
abs     -nmod+n+n       pledge-n        149.927391
abs     -nmod+n+ns      reagent-n       142.347358

Explanation

  • BEGIN{OFS=FS="\\t"} set tab as input an output field separator.
  • $2~/-n$/ {} match lines in which the 2nd field ends with "-n" and do the things within {} .
  • size=split($2,a,"-") cut the 2nd field in pieces based on the - delimiter and save them in the a[] array. Store the size of the array in size var.
  • for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] save the data in two different blocks: first everything up to the 2nd last field; then, the two last fields.
  • print $1,first,second,$3 print everything.
  • first=second="" unset the variables.

give this one-liner a try: (gawk)

awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' file

output with your file (as f ):

kent$  awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' f
abs     nmod+n+n        commitment-n    349.200023
abs     nmod+n+n-a      commitment-n    333.306429
abs     nmod+n+ns       commitment-n    182.085018
abs     nmod+n+n        pledge-n        149.927391
abs     nmod+n+ns       reagent-n       142.347358

Using sed :

sed -r -n '/-n\t[0-9.]*$/{s/(\S+)\t(.*)-([^-]+-\S+)\t(.*)/\1\t\2\t\3\t\4/p}' filename

For your input, it'd produce:

abs nmod+n+n    commitment-n    349.200023
abs nmod+n+n-a  commitment-n    333.306429
abs nmod+n+ns   commitment-n    182.085018
abs nmod+n+n    pledge-n    149.927391
abs nmod+n+ns   reagent-n   142.347358

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM