简体   繁体   中英

Bash loop only read the last line

I have problems trying to extract data behind colons in multiple lines using while loop and awk .

This is my data structure:

Identifiers:BioSample:SAMD00019077
Identifiers:BioSample:SAMD00019076
Identifiers:BioSample:SAMD00019075
Identifiers:BioSample:SAMD00019074
Identifiers:BioSample:SAMD00019073
Identifiers:BioSample:SAMD00019072
Identifiers:BioSample:SAMD00019071;SRA:DRS051563
Identifiers:BioSample:SAMD00019070;SRA:DRS051562
Identifiers:BioSample:SAMD00019069;SRA:DRS051561
...
Identifiers:BioSample:SAMD00019005;SRA:DRS051497
Identifiers:BioSample:SAMD00015713;SRA:DRS012785

What I want to get is the BioSample ID , which is like SAMD00019077 .

Scripts I tried:

  1. while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
  2. for line in cat 1.tmp ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done
  3. for line in cat 1.tmp ; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done ; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done

They only gave Biosample ID of the last line:

$ while read line ; do echo $line | 
  awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
$ head 1.tmp2
SAMD00015713;SRA

I read the posts here and looks like my problem is something to do with stdin , stdout and stderr .

bash read loop only reading first line of input variable

bash while loop read only one line

Solution I tried, it gave result of 1 line

$ exec 3<&1
$ exec 1<&2
$ while read line ; do echo $line |  
  awk -F':' '{print $3}' > 1.tmp2 ; done< 1.tmp
$ head 1.tmp2
SAMD00015713;SRA
$ exec 1<&3 3<&-

Also I tried exec < 1.tmp to direct a file to stdin but it lead to error.

I found these scripts worked very well for me. But I really want to know why the scripts I tried above fail.

cat 1.tmp | awk -F: '{print $3}' | head

awk -F: '{print $3}' 1.tmp | head

由于您要遍历1.tmp中的每一行,因此请使用>> 1.tmp2以附加模式而不是> 1.tmp2重定向输出,这将继续替换上一个条目。

First of all, awk has the ability to loop through lines and the field separator can be a regex.

So, your script can be reduced to this optimized format:

awk -F'[;:]' '{print $3}' 1.tmp > 1.tmp2

This is the optimized format that you can use.

Having said that, you might want to know what was wrong in the your script.

while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
                                                         ^ here

The > marked above is the redirection operator. It writes the stdout of the command ( awk in this case) to the file specified. It does not append, but overwrite. So, in every iteration of the loop, the file is cleared and the output of the command is written to it. Hence it leaves only the last entry.

To fix that, you can use the append redirection: >> .

while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp

Now, there is a caveat. What if the file is not originally empty? This loop will append to the file, without clearing the file first. To fix that, you can first clear the file with:

>1.tmp2; while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp

However, if we are sure that all the stdout produced by the loop needs to go into the file, you can simply move the redirection out of the loop. That way, shell does not have to keep opening and closing the file descriptors.

while read line ; do echo $line | awk -F':' '{print $3}'; done < 1.tmp > 1.tmp2

Note that these options are unoptimized, but would still work. The optimized option would be to let awk itself do the line-by-line processing as mentioned in the first snippet in the answer.

I took your lines and put them in a file called "tmp".

Here is the command :

awk -F"[:;]" '{print $3}' tmp

The result is :

SAMD00019077
SAMD00019076
SAMD00019075
SAMD00019074
SAMD00019073
SAMD00019072
SAMD00019071
SAMD00019070
SAMD00019069
SAMD00019005

The "[:;]" part is a regex that defines two delimiters : or ; .

EDIT : if you wanna do it in a while loop, here is the trick :

while read line; do echo $line | awk -F"[:;]" '{print $3}';done < <(cat tmp)

Seems like the loop is working fine but you have redirected only last element in file. > is used to redirect output in a file and everytime it will empty file and wipe out previous data. >> will append the data in last line of file.

If you are using awk within a loop -- you are most likely using it wrong. awk reads each line and acts on it by applying the rules you specify. Calling it in a loop is almost never required. Your awk statement:

awk -F: '{print $3}' 1.tmp
  • uses -F: to specify that the internal awk variable FS (field separator) is set to the ':' character, so your fields will be what is separated by ':' .
  • '{print $3}' is an awk rule. (what is within {...} ) You can have as many rules as you like. Here print $3 simply prints the 3rd field.
  • 1.tmp is obviously your input file (you can specify as many input files as needed).

You then pipe to head which displays the first 10 lines (default).

The only issue you are not clear on is whether you want to capture the 3rd field in a separate file. (you include 1.tmp2 in some of the things you tried). If you do want to capture the 3rd field in a separate file, you can do so by redirecting to the file within the awk rule itself, eg

awk -F: '{print $3 > "1.tmp2"}' 1.tmp

Now you have the 3rd field captured in 1.tmp2 and if you want to check, you can use head 1.tmp2 .

However, since your 3rd field also contains the BioSample ID and additional characters, eg ;SRA on some of the fields, if the additional characters are unwanted, you will need to remove those leaving only the BioSample ID . awk has a good number of String Functions of which sub can make replacements in fields (or variables) based on a regular expression you provide.

In your case using your sample input, eg

$ cat 1.tmp
Identifiers:BioSample:SAMD00019077
Identifiers:BioSample:SAMD00019076
Identifiers:BioSample:SAMD00019075
Identifiers:BioSample:SAMD00019074
Identifiers:BioSample:SAMD00019073
Identifiers:BioSample:SAMD00019072
Identifiers:BioSample:SAMD00019071;SRA:DRS051563
Identifiers:BioSample:SAMD00019070;SRA:DRS051562
Identifiers:BioSample:SAMD00019069;SRA:DRS051561
...
Identifiers:BioSample:SAMD00019005;SRA:DRS051497
Identifiers:BioSample:SAMD00015713;SRA:DRS012785

You could use the following (with a check on the number of fields to skip the "..." line) to isolate the BioSample ID without the ';' and what follows it writing the result to 1.tmp2 using:

$ awk -F: 'NF >= 3 {sub(/;.*/,"",$3); print $3 > "1.tmp2"}' 1.tmp

( note: the addition of NF >= 3 before your rule ensures only line where the NF (number of fields) greater than or equal to 3 are processed by the rule)

Example Output File

$ cat 1.tmp2
SAMD00019077
SAMD00019076
SAMD00019075
SAMD00019074
SAMD00019073
SAMD00019072
SAMD00019071
SAMD00019070
SAMD00019069
SAMD00019005
SAMD00015713

As others have mentioned using awk 'script' > 1.tmp2 within the loop is causing the output of awk for the current line to overwrite the contents of 1.tmp2 on every iteration of the loop. You can solve that by using >> 1.tmp2 inside the loop or moving the > 1.tmp2 outside of the loop (see below) but the right way to do what you want is just not to use a loop at all and simply do:

awk -F'[:;]' '{print $3}' 1.tmp > 1.tmp2

Just FYI though if you WERE going to use a loop (don't!) then either of these would produce the output you expect:

while IFS= read -r line; do
    echo "$line" | awk -F'[:;]' '{print $3}'
done < 1.tmp > 1.tmp2

while IFS= read -r line; do
    echo "$line" | awk -F'[:;]' '{print $3}' >> 1.tmp2
done < 1.tmp

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for details on writing read loops in shell.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM