简体   繁体   English

使用 sed 重命名文本文件中的特定行

[英]rename specific lines in a text file with sed

I have a file that looks like this:我有一个看起来像这样的文件:

>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj

I would like to edit just the lines starting with > , ideally in-place, to get a file:我想只编辑以>开头的行,最好是就地编辑,以获取文件:

>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination.我知道原则上这可以通过 sed/awk/cut 的各种组合来实现,但我一直无法找出正确的组合。 Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.理想情况下它应该很快——文件有数百万行,而且许多行也很长。

Key things about the lines I want to edit:关于我要编辑的行的关键内容:

  • Always start with >始终以>开头
  • The bit I want to keep is always between the first and second pipe symbol |我想保留的位总是在第一个和第二个 pipe 符号之间| (hence thinking cut is going to help (因此思考cut会有所帮助
  • The bit I want to keep has alphanumeric symbols and sometimes underscores.我想保留的位有字母数字符号,有时还有下划线。 The rest of the string on the same line can have any symbols同一行字符串的 rest 可以有任何符号

What I've tried that seems helpful我尝试过的似乎很有帮助

(Most of my sed attempts are pure garbage) (我的大多数 sed 尝试都是纯垃圾)

cut -d '|' -f 2  test.txt

Gets me the bit of the string that I want, and it keeps the other lines too.得到我想要的字符串,它也保留其他行。 So it's close, but (of course) it doesn't preserve the initial > on the lines where cut applies, so it's missing a crucial part of the solution.所以它很接近,但是(当然)它不会在cut应用的行上保留初始> ,因此它缺少解决方案的关键部分。

With sed :使用sed

sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
  • /^>/ to select lines starting with > , not strictly necessary for given sample but sometimes this provides faster result than using s alone /^>/到 select 以>开头的行,对于给定的样本不是绝对必要的,但有时这比单独使用s提供更快的结果
  • ^[^|]+\| this will match non |这将匹配非| characters from the start of line从行首开始的字符
  • ([^|]+) capture the second field ([^|]+)捕获第二个字段
  • .* rest of the line .* rest 的线
  • >\1 replacement string where \1 will have the contents of ([^|]+) >\1替换字符串,其中\1将具有([^|]+)的内容

If your input has only ASCII character, this would give you much faster results:如果您的输入只有 ASCII 字符,这将为您提供更快的结果:

LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'

Timing定时

  • Checking the timing results by creating a huge file from given input sample, awk is much faster and mawk is even faster通过从给定的输入样本创建一个大文件来检查时序结果, awk更快,而mawk甚至更快
  • However, OP reports that the sed solution is faster for the actual data但是,OP报告说sed解决方案对于实际数据更快

With your shown samples, you could simply try following.使用您显示的示例,您可以简单地尝试以下操作。 In this code, we are setting field separator as |在此代码中,我们将字段分隔符设置为| for all the lines of Input_file then in main program checking if line starts from > then print 2nd field else print the complete line.对于 Input_file 的所有行,然后在主程序中检查行是否从>开始,然后打印第二个字段,否则打印完整的行。

awk -F'|' '/^>/{print ">"$2;next} 1' Input_file

Explanation: Adding detailed explanation for above.说明:为上述添加详细说明。

awk -F'|' '     ##Starting awk program from here and setting field separator as | here.
/^>/{           ##Checking condition if line starts from > then do following.
  print ">"$2   ##Printing 2nd field of current line here.
  next          ##next will skip all further statements from here.
}
1               ##Will print current line.
' Input_file    ##mentioning Input_file name here.

You can also use the following awk command:您还可以使用以下awk命令:

awk  -F\| '/^>/{print ">"$2} !/^>/{print}' file
# Inplace replacement with gawk (GNU awk)
gawk -i inplace  -F\| '/^>/{print ">"$2} !/^>/{print}' file
# "Inline-like" replacement with any awk
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file > tmp && mv tmp file

Here,这里,

  • -F\| - sets the field separator to a | - 将字段分隔符设置为| char字符
  • /^>/ is the condition: if line starts with < (and !/^>/ means the opposite) /^>/是条件:如果行以<开头(而!/^>/表示相反)
  • {print ">"$2} prints the Field 2 value with a > char prepended to it {print ">"$2}打印字段 2 的值,并在其前面加上>字符
  • {print} simply prints the full line. {print}只是打印整行。

Note that since !/^>/{print} can be reduced to !/^>/ as print is the default action.请注意,由于!/^>/{print}可以简化为!/^>/因为print是默认操作。

See an online demo :查看在线演示

s='>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj'
awk  -F\| '/^>/{print ">"$2} !/^>/{print}' <<< "$s"

Output: Output:

>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM