简体   繁体   English

Bash sed 与贪婪的正则表达式

[英]Bash sed with greedy regex

I have a GTF file (type of TSV) with the following structure:我有一个具有以下结构的 GTF 文件(TSV 类型):

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|    13511132.24 244.489 2.7098
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|   68  26.127  0   0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|   712 493.243 0   0

I would like to remove all the names from the first column but the first, as separated by the "|".我想从第一列中删除除第一列之外的所有名称,以“|”分隔。 For example, the first line should be:例如,第一行应该是:

ENST00000488147.1    13511132.24 244.489 2.7098

My idea is to replace everything from first "|"我的想法是从第一个“|”替换所有内容to the first "\t" with "\t", but sed is failing me.到第一个带有“\t”的“\t”,但是 sed 让我失望了。 This command makes no changes:此命令不做任何更改:

sed 's/|*\t/\t/' test.tsv 

What am I doing wrong, and is there a better way to do this completely?我做错了什么,有没有更好的方法来完全做到这一点?

Consider:考虑:

sed -re $'s@[|][^\t]*\t@\t@g'
  • Using $'...' is a ksh/bash syntax extension that makes $'\t' be expanded to a literal tab by the shell, instead of assuming that you have a sed that (without reference to the standard) treats \t sequences as if they were tabs.使用$'...'是一个 ksh/bash 语法扩展,它使$'\t'被 shell 扩展为文字选项卡,而不是假设你有一个sed (不参考标准)对待\t序列就好像它们是标签一样。
  • sed -r puts sed in POSIX ERE mode, vs BRE mode. sed -rsed置于 POSIX ERE 模式,而不是 BRE 模式。
  • Using [|] matches only the literal |使用[|]仅匹配文字| character, regardless of which regex syntax variant is in use.字符,无论使用哪种正则表达式语法变体。
  • Using [^\t]* matches zero-or-more things that are not tabs , whereas .* would match things that are tabs, which wouldn't result in the desired output.使用[^\t]*匹配零个或多个非制表符,而.*匹配制表符,这不会导致所需的 output。

In context, as testable code:在上下文中,作为可测试代码:

write_line() {
  printf '%s\t' "$@" && printf '\n';
}
generate_input() {
  write_line 'ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|' 13511132.24 244.489 2.7098
  write_line 'ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|'    68  26.127  0   0
  write_line 'ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|'    712 493.243 0   0
}
generate_input | sed -re $'s@[|][^\t]*\t@\t@g'

...produces as output : ...产生为 output

ENST00000488147.1   13511132.24 244.489 2.7098  
ENST00000619216.1   68  26.127  0   0   
ENST00000473358.1   712 493.243 0   0   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM