简体   繁体   中英

Extract specific columns from file with shell script

I have a huge text file, from which I want to extract specific columns. I can do this in python, but as the file has ~1.2 billion lines, this is way to slow. The file looks like this (one line shown):

chr1    9734    10486   ID=SRX502813;Name=Input%20control%20(@%20IMR-90);Title=GSM1358818:%20HIRA%20OIS%20Control%20input%20DNA%3B%20Homo%20sapiens%3B%20ChIP-Seq;Cell%20group=Lung;<br>source_name=Fibroblasts;cell%20line=IMR90;chip%20antibody=none; 1000    .   9734    10486   255,0,0

Now, I would like to extract the first three columns and the ID, which is part of the fourth column:

chr1    9734    10486   SRX502813

I can extract the first three columns with the following code, but i can't get the splitting of the substring in the 4th column to work:

#!/usr/bin/bash
# -*- coding: None -*-
end_of_file=0
while [[ $end_of_file == 0 ]]; do
  read -r line
  end_of_file=$?
  grep SRX* | cut -f 1-3 >> out_file.txt
done < "$1"

Possibly someone can provide a hint on how to solve this problem? Thanks a lot!

Assuming the 4th column always starts with ID= followed by the ID, followed by ; and assuming the first 3 columns don't contain ID= , you can use sed like this

sed 's/ID=\([^;]*\);.*/\1/' inputfile

This captures a sequence of characters except ; after ID= . In case the ID is not always terminated with a ; You can use an alternative pattern that searches for a sequence of alphanumeric characters

sed 's/ID=\([[:alnum:]]*\).*/\1/' inputfile

When I create a file inputfile that contains exactly the line from the question, I get the result

chr1    9734    10486   SRX502813

In case you want to extract only the lines that contain ID=SRX you can combine this with fgrep

fgrep 'ID=SRX' inputfile | sed 's/ID=\([^;]*\);.*/\1/'

使用awk。

awk -F';' '{sub(/ID=/,"");print $1}' inputfile

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM