[英]linux: extract pattern from file
I have a big tab delimited .txt file of 4 columns 我有一个4列大的制表符分隔.txt文件
col1 col2 col3 col4
name1 1 2 ens|name1,ccds|name2,ref|name3,ref|name4
name2 3 10 ref|name5,ref|name6
... ... ... ...
Now I want to extract from this file everything that starts with 'ref|'. 现在我想从这个文件中提取以'ref |'开头的所有内容。 This pattern is only present in col4
此模式仅存在于col4中
So for this example I would like to have as output 所以对于这个例子,我希望得到输出
ref|name3
ref|name4
ref|name5
ref|name6
I thought of using 'sed' for this, but I don't know where to start. 我想过为这个使用'sed',但我不知道从哪里开始。
I think awk
is better suited for this task: 我认为
awk
更适合这项任务:
$ awk '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6
FS='( )|(,)'
sets a multile FS
to itinerate columns by ,
and blank spaces
, then prints the column when it finds the ref
pattern. FS='( )|(,)'
设置一个multile FS
通过向巡回列,
和blank spaces
,那么将输出列当找到ref
图案。
Now I want to extract from this file everything that starts with 'ref|'.
现在我想从这个文件中提取以'ref |'开头的所有内容。 This pattern is only present in col4
此模式仅存在于col4中
If you are sure that the pattern only present in col4, you could use grep: 如果你确定模式只出现在col4中,你可以使用grep:
grep -o 'ref|[^,]*' file
output: 输出:
ref|name3
ref|name4
ref|name5
ref|name6
我的一个解决方案是首先使用awk
来获取第4列,然后使用sed
将逗号转换为换行符,然后使用grep
(或awk
再次)来获取以ref
开头的那些:
awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"
This might work for you (GNU sed): 这可能适合你(GNU sed):
sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file
Surround the required strings by newlines and only print those lines that begin with the start of the required string. 通过换行包围所需的字符串,并仅打印以所需字符串的开头开头的那些行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.