I have problems trying to extract data behind colons in multiple lines using while loop and awk
.
This is my data structure:
Identifiers:BioSample:SAMD00019077
Identifiers:BioSample:SAMD00019076
Identifiers:BioSample:SAMD00019075
Identifiers:BioSample:SAMD00019074
Identifiers:BioSample:SAMD00019073
Identifiers:BioSample:SAMD00019072
Identifiers:BioSample:SAMD00019071;SRA:DRS051563
Identifiers:BioSample:SAMD00019070;SRA:DRS051562
Identifiers:BioSample:SAMD00019069;SRA:DRS051561
...
Identifiers:BioSample:SAMD00019005;SRA:DRS051497
Identifiers:BioSample:SAMD00015713;SRA:DRS012785
What I want to get is the BioSample ID
, which is like SAMD00019077
.
Scripts I tried:
while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
for line in
cat 1.tmp ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done
; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done
for line in
cat 1.tmp ; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done
; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done
They only gave Biosample ID
of the last line:
$ while read line ; do echo $line |
awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
$ head 1.tmp2
SAMD00015713;SRA
I read the posts here and looks like my problem is something to do with stdin
, stdout
and stderr
.
bash read loop only reading first line of input variable
bash while loop read only one line
Solution I tried, it gave result of 1 line
$ exec 3<&1
$ exec 1<&2
$ while read line ; do echo $line |
awk -F':' '{print $3}' > 1.tmp2 ; done< 1.tmp
$ head 1.tmp2
SAMD00015713;SRA
$ exec 1<&3 3<&-
Also I tried exec < 1.tmp
to direct a file to stdin
but it lead to error.
I found these scripts worked very well for me. But I really want to know why the scripts I tried above fail.
cat 1.tmp | awk -F: '{print $3}' | head
awk -F: '{print $3}' 1.tmp | head
由于您要遍历1.tmp中的每一行,因此请使用>> 1.tmp2
以附加模式而不是> 1.tmp2
重定向输出,这将继续替换上一个条目。
First of all, awk
has the ability to loop through lines and the field separator can be a regex.
So, your script can be reduced to this optimized format:
awk -F'[;:]' '{print $3}' 1.tmp > 1.tmp2
This is the optimized format that you can use.
Having said that, you might want to know what was wrong in the your script.
while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
^ here
The >
marked above is the redirection operator. It writes the stdout of the command ( awk
in this case) to the file specified. It does not append, but overwrite. So, in every iteration of the loop, the file is cleared and the output of the command is written to it. Hence it leaves only the last entry.
To fix that, you can use the append redirection: >>
.
while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp
Now, there is a caveat. What if the file is not originally empty? This loop will append to the file, without clearing the file first. To fix that, you can first clear the file with:
>1.tmp2; while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp
However, if we are sure that all the stdout produced by the loop needs to go into the file, you can simply move the redirection out of the loop. That way, shell does not have to keep opening and closing the file descriptors.
while read line ; do echo $line | awk -F':' '{print $3}'; done < 1.tmp > 1.tmp2
Note that these options are unoptimized, but would still work. The optimized option would be to let awk
itself do the line-by-line processing as mentioned in the first snippet in the answer.
I took your lines and put them in a file called "tmp".
Here is the command :
awk -F"[:;]" '{print $3}' tmp
The result is :
SAMD00019077
SAMD00019076
SAMD00019075
SAMD00019074
SAMD00019073
SAMD00019072
SAMD00019071
SAMD00019070
SAMD00019069
SAMD00019005
The "[:;]"
part is a regex that defines two delimiters :
or ;
.
EDIT : if you wanna do it in a while loop, here is the trick :
while read line; do echo $line | awk -F"[:;]" '{print $3}';done < <(cat tmp)
Seems like the loop is working fine but you have redirected only last element in file. >
is used to redirect output in a file and everytime it will empty file and wipe out previous data. >>
will append the data in last line of file.
If you are using awk
within a loop -- you are most likely using it wrong. awk
reads each line and acts on it by applying the rules you specify. Calling it in a loop is almost never required. Your awk
statement:
awk -F: '{print $3}' 1.tmp
-F:
to specify that the internal awk
variable FS
(field separator) is set to the ':'
character, so your fields will be what is separated by ':'
. '{print $3}'
is an awk
rule. (what is within {...}
) You can have as many rules as you like. Here print $3
simply prints the 3rd field. 1.tmp
is obviously your input file (you can specify as many input files as needed). You then pipe to head
which displays the first 10 lines (default).
The only issue you are not clear on is whether you want to capture the 3rd field in a separate file. (you include 1.tmp2
in some of the things you tried). If you do want to capture the 3rd field in a separate file, you can do so by redirecting to the file within the awk
rule itself, eg
awk -F: '{print $3 > "1.tmp2"}' 1.tmp
Now you have the 3rd field captured in 1.tmp2
and if you want to check, you can use head 1.tmp2
.
However, since your 3rd field also contains the BioSample ID
and additional characters, eg ;SRA
on some of the fields, if the additional characters are unwanted, you will need to remove those leaving only the BioSample ID
. awk
has a good number of String Functions of which sub
can make replacements in fields (or variables) based on a regular expression you provide.
In your case using your sample input, eg
$ cat 1.tmp
Identifiers:BioSample:SAMD00019077
Identifiers:BioSample:SAMD00019076
Identifiers:BioSample:SAMD00019075
Identifiers:BioSample:SAMD00019074
Identifiers:BioSample:SAMD00019073
Identifiers:BioSample:SAMD00019072
Identifiers:BioSample:SAMD00019071;SRA:DRS051563
Identifiers:BioSample:SAMD00019070;SRA:DRS051562
Identifiers:BioSample:SAMD00019069;SRA:DRS051561
...
Identifiers:BioSample:SAMD00019005;SRA:DRS051497
Identifiers:BioSample:SAMD00015713;SRA:DRS012785
You could use the following (with a check on the number of fields to skip the "..."
line) to isolate the BioSample ID
without the ';'
and what follows it writing the result to 1.tmp2
using:
$ awk -F: 'NF >= 3 {sub(/;.*/,"",$3); print $3 > "1.tmp2"}' 1.tmp
( note: the addition of NF >= 3
before your rule ensures only line where the NF
(number of fields) greater than or equal to 3 are processed by the rule)
Example Output File
$ cat 1.tmp2
SAMD00019077
SAMD00019076
SAMD00019075
SAMD00019074
SAMD00019073
SAMD00019072
SAMD00019071
SAMD00019070
SAMD00019069
SAMD00019005
SAMD00015713
As others have mentioned using awk 'script' > 1.tmp2
within the loop is causing the output of awk for the current line to overwrite the contents of 1.tmp2 on every iteration of the loop. You can solve that by using >> 1.tmp2
inside the loop or moving the > 1.tmp2
outside of the loop (see below) but the right way to do what you want is just not to use a loop at all and simply do:
awk -F'[:;]' '{print $3}' 1.tmp > 1.tmp2
Just FYI though if you WERE going to use a loop (don't!) then either of these would produce the output you expect:
while IFS= read -r line; do
echo "$line" | awk -F'[:;]' '{print $3}'
done < 1.tmp > 1.tmp2
while IFS= read -r line; do
echo "$line" | awk -F'[:;]' '{print $3}' >> 1.tmp2
done < 1.tmp
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for details on writing read loops in shell.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.