简体   繁体   English

从大文件中删除确切的字符串?

[英]delete exact string from a large file?

I have data in the following form in a file: 我在文件中有以下形式的数据:

    <http://purl.uniprot.org/here>   <http://purl.uniprot.org/here/unipot/purl>
    <http://purl.uniprot.org/uniprot/Q196Y7>        <http://purl.uniprot.org/core/annotation>

I want to remove all " http://purl.uniprot.org " which are within the angular brackets. 我想删除尖括号内的所有“ http://purl.uniprot.org ”。 Such that the output which I get is 这样我得到的输出就是

    <here>   <here/unipot/purl>
    <uniprot/Q196Y7>        <core/annotation>

I tried to do so using vi's replace command. 我尝试使用vi的replace命令这样做。 But it turned out to be quite slow as my file is of 1TB. 但事实证明,因为我的文件是1TB,所以速度很慢。 Is there a more efficient way to do the same using linux/python 使用linux / python是否有更有效的方法来做同样的事情

I know I can use sed but sed find's patterns and deletes them whereas I want to delete the exact contents 我知道我可以使用sed但是sed找到模式并删除它们,而我想删除确切的内容

This should work from command-line: 这应该来自命令行:

sed -i 's/http:\/\/purl.uniprot.org\///g' /path/to/filename

You can try first without -i argument to see the output in your console. 您可以先使用-i参数尝试查看控制台中的输出。

As Radu Rădeanu said , sed is a good tool for replacing strings in files since it works on streams instead of trying to load the whole file into memory. 正如RaduRădeanu所说sed是替换文件中字符串的好工具,因为它在流上工作而不是试图将整个文件加载到内存中。

But sed uses regular expressions and in your case (1TB of input data), this might be too slow. 但是sed使用正则表达式,在你的情况下(1TB的输入数据),这可能太慢了。 Unix tools can often handle files of arbitrary size and they are surprisingly efficient but corner cases might be too much. Unix工具通常可以处理任意大小的文件,并且它们的效率非常高,但极端情况可能太多了。

If you need to optimize the process, here are a few pointers: 如果您需要优化流程,请参考以下几点:

  1. Split the huge file into smaller ones. 将巨大的文件拆分成较小的文件。 For example, if this is a log file, create a single file per day instead of concatenating everything into one huge file. 例如,如果这是一个日志文件,则每天创建一个文件,而不是将所有内容连接到一个巨大的文件中。 That way, you can strip the string once in each daily file. 这样,您可以在每个每日文件中删除一次字符串。

  2. Write a small C program that searches for the exact string (instead of using a regexp). 编写一个小型C程序,搜索确切的字符串(而不是使用正则表达式)。 You can then use optimizations like Boyer-Moore to get a huge performance boost. 然后,您可以使用像Boyer-Moore这样的优化来获得巨大的性能提升。 You should also consider using memory-mapped I/O. 您还应该考虑使用内存映射I / O.

what do you mean by "But it turned out to be quite" ? 你是什​​么意思“但事实证明是相当的” quite what? 什么呢? If it's me , vi is just a very good tool.run this command: 如果是我,vi只是一个非常好的工具。运行此命令:

:s/http:\/\/purl.uniprot.org\//g

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM