简体   繁体   English

linux:从文件中提取模式

[英]linux: extract pattern from file

I have a big tab delimited .txt file of 4 columns 我有一个4列大的制表符分隔.txt文件

col1    col2    col3    col4
name1   1       2       ens|name1,ccds|name2,ref|name3,ref|name4
name2   3       10      ref|name5,ref|name6
...     ...     ...     ...

Now I want to extract from this file everything that starts with 'ref|'. 现在我想从这个文件中提取以'ref |'开头的所有内容。 This pattern is only present in col4 此模式仅存在于col4中

So for this example I would like to have as output 所以对于这个例子,我希望得到输出

ref|name3
ref|name4
ref|name5
ref|name6

I thought of using 'sed' for this, but I don't know where to start. 我想过为这个使用'sed',但我不知道从哪里开始。

I think awk is better suited for this task: 我认为awk更适合这项任务:

$ awk  '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6

FS='( )|(,)' sets a multile FS to itinerate columns by , and blank spaces , then prints the column when it finds the ref pattern. FS='( )|(,)'设置一个multile FS通过向巡回列,blank spaces ,那么将输出列当找到ref图案。

Now I want to extract from this file everything that starts with 'ref|'. 现在我想从这个文件中提取以'ref |'开头的所有内容。 This pattern is only present in col4 此模式仅存在于col4中

If you are sure that the pattern only present in col4, you could use grep: 如果你确定模式只出现在col4中,你可以使用grep:

grep -o 'ref|[^,]*' file

output: 输出:

ref|name3
ref|name4
ref|name5
ref|name6

我的一个解决方案是首先使用awk来获取第4列,然后使用sed将逗号转换为换行符,然后使用grep (或awk再次)来获取以ref开头的那些:

awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"

This might work for you (GNU sed): 这可能适合你(GNU sed):

sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file

Surround the required strings by newlines and only print those lines that begin with the start of the required string. 通过换行包围所需的字符串,并仅打印以所需字符串的开头开头的那些行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM