[英]Bash sed with greedy regex
I have a GTF file (type of TSV) with the following structure:我有一个具有以下结构的 GTF 文件(TSV 类型):
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 13511132.24 244.489 2.7098
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 68 26.127 0 0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA| 712 493.243 0 0
I would like to remove all the names from the first column but the first, as separated by the "|".我想从第一列中删除除第一列之外的所有名称,以“|”分隔。 For example, the first line should be:
例如,第一行应该是:
ENST00000488147.1 13511132.24 244.489 2.7098
My idea is to replace everything from first "|"我的想法是从第一个“|”替换所有内容to the first "\t" with "\t", but sed is failing me.
到第一个带有“\t”的“\t”,但是 sed 让我失望了。 This command makes no changes:
此命令不做任何更改:
sed 's/|*\t/\t/' test.tsv
What am I doing wrong, and is there a better way to do this completely?我做错了什么,有没有更好的方法来完全做到这一点?
Consider:考虑:
sed -re $'s@[|][^\t]*\t@\t@g'
$'...'
is a ksh/bash syntax extension that makes $'\t'
be expanded to a literal tab by the shell, instead of assuming that you have a sed
that (without reference to the standard) treats \t
sequences as if they were tabs.$'...'
是一个 ksh/bash 语法扩展,它使$'\t'
被 shell 扩展为文字选项卡,而不是假设你有一个sed
(不参考标准)对待\t
序列就好像它们是标签一样。sed -r
puts sed
in POSIX ERE mode, vs BRE mode. sed -r
将sed
置于 POSIX ERE 模式,而不是 BRE 模式。[|]
matches only the literal |
[|]
仅匹配文字|
character, regardless of which regex syntax variant is in use.[^\t]*
matches zero-or-more things that are not tabs , whereas .*
would match things that are tabs, which wouldn't result in the desired output.[^\t]*
匹配零个或多个非制表符,而.*
将匹配制表符,这不会导致所需的 output。 In context, as testable code:在上下文中,作为可测试代码:
write_line() {
printf '%s\t' "$@" && printf '\n';
}
generate_input() {
write_line 'ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|' 13511132.24 244.489 2.7098
write_line 'ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|' 68 26.127 0 0
write_line 'ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|' 712 493.243 0 0
}
generate_input | sed -re $'s@[|][^\t]*\t@\t@g'
...produces as output : ...产生为 output :
ENST00000488147.1 13511132.24 244.489 2.7098
ENST00000619216.1 68 26.127 0 0
ENST00000473358.1 712 493.243 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.