I have a huge text file, from which I want to extract specific columns. I can do this in python, but as the file has ~1.2 billion lines, this is way to slow. The file looks like this (one line shown):
chr1 9734 10486 ID=SRX502813;Name=Input%20control%20(@%20IMR-90);Title=GSM1358818:%20HIRA%20OIS%20Control%20input%20DNA%3B%20Homo%20sapiens%3B%20ChIP-Seq;Cell%20group=Lung;<br>source_name=Fibroblasts;cell%20line=IMR90;chip%20antibody=none; 1000 . 9734 10486 255,0,0
Now, I would like to extract the first three columns and the ID, which is part of the fourth column:
chr1 9734 10486 SRX502813
I can extract the first three columns with the following code, but i can't get the splitting of the substring in the 4th column to work:
#!/usr/bin/bash
# -*- coding: None -*-
end_of_file=0
while [[ $end_of_file == 0 ]]; do
read -r line
end_of_file=$?
grep SRX* | cut -f 1-3 >> out_file.txt
done < "$1"
Possibly someone can provide a hint on how to solve this problem? Thanks a lot!
Assuming the 4th column always starts with ID=
followed by the ID, followed by ;
and assuming the first 3 columns don't contain ID=
, you can use sed
like this
sed 's/ID=\([^;]*\);.*/\1/' inputfile
This captures a sequence of characters except ;
after ID=
. In case the ID is not always terminated with a ;
You can use an alternative pattern that searches for a sequence of alphanumeric characters
sed 's/ID=\([[:alnum:]]*\).*/\1/' inputfile
When I create a file inputfile
that contains exactly the line from the question, I get the result
chr1 9734 10486 SRX502813
In case you want to extract only the lines that contain ID=SRX
you can combine this with fgrep
fgrep 'ID=SRX' inputfile | sed 's/ID=\([^;]*\);.*/\1/'
使用awk。
awk -F';' '{sub(/ID=/,"");print $1}' inputfile
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.