Bash sed 与贪婪的正则表达式

Question

I have a GTF file (type of TSV) with the following structure:我有一个具有以下结构的 GTF 文件（TSV 类型）：

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|    13511132.24 244.489 2.7098
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|   68  26.127  0   0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|   712 493.243 0   0

I would like to remove all the names from the first column but the first, as separated by the "|".我想从第一列中删除除第一列之外的所有名称，以“|”分隔。 For example, the first line should be:例如，第一行应该是：

ENST00000488147.1    13511132.24 244.489 2.7098

My idea is to replace everything from first "|"我的想法是从第一个“|”替换所有内容to the first "\t" with "\t", but sed is failing me.到第一个带有“\t”的“\t”，但是 sed 让我失望了。 This command makes no changes:此命令不做任何更改：

sed 's/|*\t/\t/' test.tsv

What am I doing wrong, and is there a better way to do this completely?我做错了什么，有没有更好的方法来完全做到这一点？

Answer 1

Consider:考虑：

sed -re $'s@[|][^\t]*\t@\t@g'

Using $'...' is a ksh/bash syntax extension that makes $'\t' be expanded to a literal tab by the shell, instead of assuming that you have a sed that (without reference to the standard) treats \t sequences as if they were tabs.使用$'...'是一个 ksh/bash 语法扩展，它使$'\t'被 shell 扩展为文字选项卡，而不是假设你有一个sed （不参考标准）对待\t序列就好像它们是标签一样。
sed -r puts sed in POSIX ERE mode, vs BRE mode. sed -r将sed置于 POSIX ERE 模式，而不是 BRE 模式。
Using [|] matches only the literal |使用[|]仅匹配文字| character, regardless of which regex syntax variant is in use.字符，无论使用哪种正则表达式语法变体。
Using [^\t]* matches zero-or-more things that are not tabs , whereas .* would match things that are tabs, which wouldn't result in the desired output.使用[^\t]*匹配零个或多个非制表符，而.*将匹配制表符，这不会导致所需的 output。

In context, as testable code:在上下文中，作为可测试代码：

write_line() {
  printf '%s\t' "$@" && printf '\n';
}
generate_input() {
  write_line 'ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|' 13511132.24 244.489 2.7098
  write_line 'ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|'    68  26.127  0   0
  write_line 'ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|'    712 493.243 0   0
}
generate_input | sed -re $'s@[|][^\t]*\t@\t@g'

...produces as output : ...产生为 output ：

ENST00000488147.1   13511132.24 244.489 2.7098  
ENST00000619216.1   68  26.127  0   0   
ENST00000473358.1   712 493.243 0   0

Bash sed 与贪婪的正则表达式

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-09-20 19:46:26

Bash sed 与贪婪的正则表达式

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-09-20 19:46:26

解决方案1
2 已采纳 2019-09-20 19:46:26