awk/grep certain parts of a specific column

Question

I have a question that I am at a loss to solve. I have 3 column tab-separated data, such as:

abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs into+ns-j+vn-pass-rb-divide-v   295.57316
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

I need to isolate the last two "elements" of the third column, in which my desired result would be a 4-column output that only contains those elements that end with "-n".

such as:

abs nmod+n+n   commitment-n   349.200023
abs nmod+n+n-a   commitment-n 333.306429
abs nmod+n+ns   commitment-n  182.085018
abs nmod+n+n   pledge-n   149.927391
abs nmod+n+ns   reagent-n 142.347358

In this case, is there an awk , grep anything that can help? The files are approx. 500 MB, so they are not huge, but not small either. Thanks for any insight.

Answer 1

With this you can check if the 2nd column ends with -n and then print the lines:

$ awk '$2~/-n$/' file
abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

To have the second field splitted so that the last two elements are isolated, you can use:

awk 'BEGIN{OFS=FS="\t"}
     $2~/-n$/ {
               size=split($2,a,"-");
               for (i=1; i<=size-2; i++) first=first"-"a[i];
               second=a[size-1]"-"a[size];
               print $1,first,second,$3;
               first=second=""
              }' file

which returns

$ awk 'BEGIN{OFS=FS="\t"} $2~/-n$/ {size=split($2,a,"-"); for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]; print $1,first,second,$3; first=second=""}' file
abs     -nmod+n+n       commitment-n    349.200023
abs     -nmod+n+n-a     commitment-n    333.306429
abs     -nmod+n+ns      commitment-n    182.085018
abs     -nmod+n+n       pledge-n        149.927391
abs     -nmod+n+ns      reagent-n       142.347358

Explanation

BEGIN{OFS=FS="\\t"} set tab as input an output field separator.
$2~/-n$/ {} match lines in which the 2nd field ends with "-n" and do the things within {} .
size=split($2,a,"-") cut the 2nd field in pieces based on the - delimiter and save them in the a[] array. Store the size of the array in size var.
for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] save the data in two different blocks: first everything up to the 2nd last field; then, the two last fields.
print $1,first,second,$3 print everything.
first=second="" unset the variables.

Answer 2

give this one-liner a try: (gawk)

awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' file

output with your file (as f ):

kent$  awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' f
abs     nmod+n+n        commitment-n    349.200023
abs     nmod+n+n-a      commitment-n    333.306429
abs     nmod+n+ns       commitment-n    182.085018
abs     nmod+n+n        pledge-n        149.927391
abs     nmod+n+ns       reagent-n       142.347358

Answer 3

Using sed :

sed -r -n '/-n\t[0-9.]*$/{s/(\S+)\t(.*)-([^-]+-\S+)\t(.*)/\1\t\2\t\3\t\4/p}' filename

For your input, it'd produce:

abs nmod+n+n    commitment-n    349.200023
abs nmod+n+n-a  commitment-n    333.306429
abs nmod+n+ns   commitment-n    182.085018
abs nmod+n+n    pledge-n    149.927391
abs nmod+n+ns   reagent-n   142.347358

awk/grep certain parts of a specific column

Question

3 answers

solution1
3 ACCPTED 2013-12-06 11:40:46

Explanation

solution2
3 2013-12-06 11:49:01

solution3
1 2013-12-06 12:02:56

awk/grep certain parts of a specific column

Question

3 answers

solution1 3 ACCPTED 2013-12-06 11:40:46

Explanation

solution2 3 2013-12-06 11:49:01

solution3 1 2013-12-06 12:02:56

solution1
3 ACCPTED 2013-12-06 11:40:46

solution2
3 2013-12-06 11:49:01

solution3
1 2013-12-06 12:02:56