简体   繁体   中英

How numbered a file based on the string in first column in unix?

I'm try to assign an ID number depends on the string in first column. In this way same strings will have the same ID number as follow

Input file

rs665   XP_011539469.1
rs665   XP_016856394.1
rs980   NP_001284363.1
rs980   XP_016856698.1
rs1115  NP_001191785.1
rs1250  NP_067652.1

Desired output file

1    rs665   XP_011539469.1
1    rs665   XP_016856394.1
2    rs980   NP_001284363.1
2    rs980   XP_016856698.1
3    rs1115  NP_001191785.1
4    rs1250  NP_067652.1

And so on...

I solved by means of create a tab file with unique strings in 1st column and the corresponding NR number and then create an array by awk and connect two files to get the numbered I want. However I would like to do in one step on the same file. Should be possible in UNIX environment? Thanks in advance

Following awk may help you on same:

awk '!a[$1]++{count++}  {print count,$0}'   Input_file

Output will be as follows:

1 rs665   XP_011539469.1
1 rs665   XP_016856394.1
2 rs980   NP_001284363.1
2 rs980   XP_016856698.1
3 rs1115  NP_001191785.1
4 rs1250  NP_067652.1

Solution 2nd: Adding 1 more solution too here now, this considers if your Input_file is sorted as per first column then we need not to create an array as above solution:

awk 'prev!=$1 || !prev{count++}  {print count,$0;prev=$1}'   Input_file

If you don't have the assurance that the symbols are grouped together in consecutive runs, you better make it something like:

awk 'function intern(sym) { if (sym in table)
                              return table[sym]
                            return table[sym] = ++counter }
     { print intern($1), $1, $2 }'

This will work even if the input happens to be:

rs665   XP_011539469.1
rs980   NP_001284363.1
rs665   XP_016856394.1
rs980   XP_016856698.1
rs1115  NP_001191785.1
rs1250  NP_067652.1

Both cases of rs665 map to 1 and both rs980 cases map to 2 .

This requires memory to hold the table of known symbols.

awk 'function intern(sym) { if (sym in table && $3 ~/x/
                              return table[sym]
                            return table[sym] = ++counter}
 { print intern($2"\t"$3"\t"$4"\t"$5"\t"$6), $0 }
        function intern2(sym) { if (sym in table && $3 ~/y/)
                                  return table[sym]
                            return table[sym] = ++counter}
     { print intern2($3"\t"$4), $0 }' "input.tab" > "output.tab";

Based on this answer I'm try to do something similar. In this case I would like to numbered the file in first column for each row depends on the string within one column. So, eg if third column is == "x", numbered taking account a set of columns and if is == "y" taking account other set of columns to numbered. It would be possible to implement rebuilding the previous script? I'm trying to do with conditions and works but not correctly. Thanks anyway in advance for the previous answer @Kaz.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM