简体   繁体   中英

Separating a number within a column in awk or grep

I am writing a script that will process files that are similar in all columns but the first one ,$1. What I want to redirect to another file is dependent on the value of the third column and that I know how to discriminate for, but I want to redirect the "id"-number in the first column, not the whole first column:

Different input file formats:

1452123_s_at        0.45609 1.55e-04    7.85    -2.89   2.657
145243_s_at         0.35709 1.46e-04    7.7     -2.9    2.713

Xl.15267.1.A1_at    0.45609 1.79e-04    7.66    -2.9    2.21
Xl.14257.1.A1_at    0.76509 1.67e-04    7.85    -2.87   2.23

160919_r_at         0.45609 1.83e-04    -7.63   -2.9    -2.888
145916_r_at         0.41869 3.82e-04    -7.56   -2.8    -2.798

162334_r_at         0.51869 2.49e-04    -7.24   -2.93   -2.095
15356_r_at          0.68229 1.79e-04    -7.45   -2.88   -2.5

160365_at           0.68223 3.82e-04    -6.72   -2.98   -1.795
16345_at            0.45623 2.94e-04    -5.99   -2.45   -1.568

26768               0.51869 1.83e-04    7.66    -2.9     2.21
30075               0.67749 1.46e-04    7.45    -2.89    2.34

Desired output:

1452123     1.55e-04    
145243      1.46e-04    
15267       1.79e-04    
14257       1.67e-04    
160919      1.83e-04    
145916      3.82e-04    
162334      2.49e-04    
15356       1.79e-04    
160365      3.82e-04    
16345       2.94e-04    
26768       1.83e-04    
30075       1.46e-04    

This number can be pretty much anything between 1-10 000 000 and the structure of the whole first column can vary a bit more than this example but it will always have this number somewhere within it. Is there any way of writing something universal enough to recognize and print this number only? By using split or if somehow maybe?

It doesn't matter which program that is used, awk, grep or sed, I'm just looking for the most efficient way of doing it. I'm also pretty new to the command line, so please explain plainly and the different commands! Thanks

Just use gsub() to remove not numeric values and then print:

awk 'NF{gsub(/[^0-9]/,"",$1); print $1, $3}' file

It returns:

1452123 1.55e-04
145243 1.46e-04
1526711 1.79e-04
1425711 1.67e-04
160919 1.83e-04
145916 3.82e-04
162334 2.49e-04
15356 1.79e-04
160365 3.82e-04
16345 2.94e-04
26768 1.83e-04
30075 1.46e-04

Explanation

  • NF perform the following command in {} just if NF is true, that is, if the line is not empty.
  • gsub(/[^0-9]/,"",$1) for the 1st field, remove all characters not in the range 0-9 . That is, remove all non-numerical values.
  • print $1, $3 print the 1st and 3rd fields.

I will split over the non-digit chars, and pick up the largest number left. Below is my implementation, and thanks @fedorqui for his NF trick

NF{n=split($1,a,/[^0-9]+/); v=a[1]; for(i=2; i<=n; i++) { if (v<a[i]) v=a[i]; } print v, $3}

WIth GNU awk for gensub():

$ awk 'NF{ print gensub(/([^._]*[._])?([[:digit:]]+).*/,"\\2","",$1), $3 }' file
1452123 1.55e-04
145243 1.46e-04
15267 1.79e-04
14257 1.67e-04
160919 1.83e-04
145916 3.82e-04
162334 2.49e-04
15356 1.79e-04
160365 3.82e-04
16345 2.94e-04
26768 1.83e-04
30075 1.46e-04

I assume that the patterns belong to a finite set. So the patterns can be listed. To simplify the process I created a version:

awk '
NF && ( match($1,/^([0-9]+)((_[rs])?_at)?$/,a) ||
match($1,/^Xl\.([0-9]+)\.1\.A1_at$/,a) ) {
    printf("%-12s%-s\n", substr($1, a[1,"start"], a[1,"length"]), $3)
}
' inputfile

The first match checks for four patterns: <NUM>_s_at , <NUM>_r_at , <NUM>_at , <NUM> . The last Xl.<NUM>.1.A1_at . Then cuts off the matched number and formats the output.

Output:

1452123     1.55e-04
145243      1.46e-04
15267       1.79e-04
14257       1.67e-04
160919      1.83e-04
145916      3.82e-04
162334      2.49e-04
15356       1.79e-04
160365      3.82e-04
16345       2.94e-04
26768       1.83e-04
30075       1.46e-04

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM