I have over 14000 fasta files, and I want to keep only the ones containing 5 sequences. I know I can use the following bash command to obtain the number of sequences in a single fasta file:
grep -c "^>" filename.fasta
So my approach was to write the the filename and count of sequences in each file to a text file, which I could then use to isolate only the sequences I want. To run the grep command on so many files, I am using subprocess.call:
import subprocess
import os
with open("five_seqs.txt", "w") as f:
for file in os.listdir("/Users/vivaksoni1/Downloads/DA_CDS/fasta_files"):
f.write(file),
subprocess.call(["grep", "-c", "^>", file], stdout = f)
Part of my problem is that the grep command is "^>", but subprocess requires each argument to have its own quotation marks. How can I use "^>" when I would essentially be entering as an argument: ""^>"".
Also, do I have to add f.write("\\n") after f.write(file)? Currently my output is just a text file with each entry next to one another, and the subprocess command just prints each file name to the terminal and states no file found as such:
grep: MZ23900789.fasta: No such file or directory
Try the following code, it should work for your example. It will write the filename plus a tab separator and the number of sequences (ie >
characters). Using Popen
and communicate
gives better flexibility in handling the output. Tested on Ubuntu.
import subprocess
import os
fasta_dir = "/Users/vivaksoni1/Downloads/DA_CDS/fasta_files/"
with open("five_seqs.txt", "w") as f:
for file in os.listdir(fasta_dir):
f.write(file + '\t')
grep = subprocess.Popen(["grep", "-c", "^>", fasta_dir + file], stdout = subprocess.PIPE)
out, err = grep.communicate()
f.write(out + '\n')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.