简体   繁体   English

提取特定的文字行?

[英]Extract specific text lines?

I have a large several hudred thousand lines text file. 我有几个hudred千行文本文件。 I have to extract 30,000 specific lines that are all in the text file in random spots. 我必须在随机点中提取30,000个特定行,这些行都在文本文件中。 This is the program I have to extract one line at a time: 这是我必须一次提取一行的程序:

big_file = open('C:\\gbigfile.txt', 'r')
small_file3 = open('C:\\small_file3.txt', 'w')
for line in big_file:
   if 'S0414' in line:
      small_file3.write(line)
gbigfile.close()
small_file3.close()

How can I speed this up for 30,000 lines that I need to look up>? 如何加速我需要查找的30,000行>?

Aha! 啊哈! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. 所以你真正的问题是如何测试每行的许多条件,如果其中一个满足,则输出该行。 Easiest will be using regular expression, me thinks: 我认为最容易使用正则表达式:

import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))

for line in inf:
    if pattern.search(ln):
        outf.write(line)

Testing many conditions per line is generally slow when using a naive algorithm. 当使用朴素算法时,每行测试许多条件通常很慢。 There are various superior algorithms (eg using Tries ) which can do much better. 有各种优越的算法(例如使用Tries )可以做得更好。 I suggest you give the Aho–Corasick string matching algorithm a shot. 我建议你给Aho-Corasick字符串匹配算法一个镜头。 See here for a python implementation. 请参阅此处获取python实现。 It should be considerably faster than the naive approach of using a nested loop and testing every string individually. 它应该比使用嵌套循环和单独测试每个字符串的天真方法快得多。

According to Python's documentation of file objects , iteration you're doing should not be especially slow, and search for substrings should also be fine speed-wise. 根据Python的文件对象文档,你所做的迭代不应该特别慢,并且搜索子字符串也应该是快速的。

I don't see any reason why your code should be slow, so if you need it to go faster you might have to rewrite it in C and use mmap() for fast access to the source file. 我没有看到你的代码应该变慢的任何原因,所以如果你需要它更快,你可能不得不在C中重写它并使用mmap()来快速访问源文件。

1. Try to read whole file 1.尝试阅读整个文件

One speed up you can do is read whole file in memory if that is possible, else read in chunks. 如果可能的话,你可以做的一个加速是在内存中读取整个文件,否则读取块。 You said 'several hudred thousand lines' lets say 1 million lines with each line 100 char ie around 100 MB, if you have that much free memory (I assume you have) just do this 你说'几个hudred千行'让我们说100万行每行100个字符即大约100 MB,如果你有那么多的空闲内存(我假设你有)就这样做

big_file = open('C:\\gbigfile.txt', 'r')
big_file_lines = big_file.read_lines()
big_file.close()
small_file3 = open('C:\\small_file3.txt', 'w')
for line in big_file_lines:
   if 'S0414' in line:
      small_file3.write(line)
small_file3.close()

Time this with orginal version and see if it makes difference, I think it will. 使用原始版本计算时间,看看它是否有所不同,我认为它会。

But if your file is really big in GBs, then you can read it in chunks eg 100 MB chunks, split it into lines and search but don't forget to join lines at each 100MB interval (I can elaborate more if this is the case) 但是如果您的文件在GB中非常大,那么您可以用块读取它,例如100 MB块,将其拆分成行并搜索但不要忘记每隔100 MB间隔加入行(如果是这种情况,我可以详细说明) )

file.readlines returns a list containing all the lines of data in the file. file.readlines返回包含文件中所有数据行的列表。 If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. 如果给定一个可选的参数sizehint,它会从文件读取多个字节,并且足以完成一行,并从中返回行。 This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. 这通常用于允许按行有效读取大文件,但无需将整个文件加载到内存中。 Only complete lines will be returned. 只返回完整的行。

Also see following link for speed difference between line by line vs entire file reading. 另请参阅以下链接,了解逐行与整个文件读取之间的速度差异。 http://handyfloss.wordpress.com/2008/02/15/python-speed-vs-memory-tradeoff-reading-files/ http://handyfloss.wordpress.com/2008/02/15/python-speed-vs-memory-tradeoff-reading-files/

2. Try to write whole file 2.尝试编写整个文件

You can also store line and write them at once at end, though I am not sure if it will help much 您也可以存储行并在结束时立即写入它们,但我不确定它是否会有用

big_file = open('C:\\gbigfile.txt', 'r')
big_file_lines = big_file.read_lines()
small_file_lines = []
for line in big_file_lines:
   if 'S0414' in line:
      small_file_lines.append(line)
small_file3 = open('C:\\small_file3.txt', 'w')
small_file3.write("".join(small_file_lines))
small_file3.close()

3. Try filter 3.尝试过滤

You can also try to use filter, instead of loop see if it makes difference 您也可以尝试使用过滤器,而不是循环,看看它是否有所不同

small_file_lines= filter(lambda line:line.find('S0414') >= 0, big_file_lines)

You could try reading in big blocks, and avoiding the overhead of line-splitting except for the specific lines of interest. 您可以尝试大块读取,并避免除了特定的感兴趣的行之外的行拆分开销。 Eg, assuming none of your lines is longer than a megabyte: 例如,假设您的所有行都不超过兆字节:

BLOCKSIZE = 1024 * 1024

def byblock_fullines(f):
    tail = ''
    while True:
        block = f.read(BLOCKSIZE)
        if not block: break
        linend = block.rindex('\n')
        newtail = block[linend + 1:]
        block = tail + block[:linend + 1]
        tail = newtail
        yield block
    if tail: yield tail + '\n'

this takes an open file argument and yields blocks of about 1MB guaranteed to end with a newline. 这需要一个打开的文件参数,并产生大约1MB的块,保证以换行结束。 To identify (iterator-wise) all occurrences of a needle string within a haystack string: 要识别(以迭代器方式)在haystack字符串中出现的所有针头字符串:

def haystack_in_needle(haystack, needle):
    start = 0
    while True:
        where = haystack.find(needle, start)
        if where == -1: return
        yield where
        start = where + 1

To identify all relevant lines from within such a block: 要从这样的块中识别所有相关的行:

def wantlines_inblock(s, block):
    last_yielded = None
    for where in haystack_in_needle(block, s):
        prevend = block.rfind('\n', where)  # could be -1, that's OK
        if prevend == last_yielded: continue  # no double-yields
        linend = block.find('\n', where)
        if linend == -1: linend = len(block)
        yield block[prevend + 1: linend]
        last_yielded = prevend

How this all fits together: 这一切如何融合在一起:

def main():
    with open('bigfile.txt') as f:
        with open('smallfile.txt', 'w') as g:
            for block in byblock_fulllines(f):
                for line in wantlines_inblock('S0414', block)
                    f.write(line)

In 2.7 you could fold both with statements into one, just to reduce nesting a bit. 在2.7,你可以折叠既with语句转换为一个,只是为了减少嵌套了一下。

Note: this code is untested so there might be (hopefully small;-) errors such as off-by-one's. 注意:此代码未经测试,因此可能存在(希望很小;-)错误,例如off-by-one。 Performance needs tuning of the block size and must be calibrated by measurement on your specific machine and data. 性能需要调整块大小,并且必须通过对特定机器和数据的测量进行校准。 Your mileage may vary. 你的旅费可能会改变。 Void where prohibited by law. 在法律禁止的地方无效。

If the line begins with S0414, then you could use the .startswith method: 如果该行以S0414开头,那么您可以使用.startswith方法:

if line.startswith('S0414'): small_file3.write(line)

You could also strip left whitespace, if there is any: 你也可以剥离左边的空格,如果有的话:

line.lstrip().startswith('S0414')

If 'S0414' always appears after a certain point, for example, it is always at least 10 characters in and never in the last 5 characters, you could do: 如果'S0414'总是出现在某一点之后,例如,它总是至少10个字符,而且从不在最后5个字符中,你可以这样做:

'S0414' in line[10:-5]

Otherwise, you will have to search through each line, like you are. 否则,你必须像你一样搜索每一行。

What are the criteria that define the 30000 lines you want to extract? 定义要提取的30000行的标准是什么? The more information you give, the more likely you are to get a useful answer. 您提供的信息越多,您获得有用答案的可能性就越大。

If you want all the lines containing a certain string, or more generally containing any of a given set of strings, or an occurrence of a regular expression, use grep . 如果您希望所有行包含某个字符串,或者更通常包含任何给定字符串集,或者出现正则表达式,请使用grep It's likely to be significantly faster for large data sets. 大型数据集的速度可能要快得多。

This reminds me of a problem described by Tim Bray , who attempted to extract data from web server log files using multi-core machines. 这让我想起了Tim Bray所描述的一个问题,他试图使用多核机器从Web服务器日志文件中提取数据。 The results are described in The Wide Finder Project and Wide Finder 2 . 结果在Wide Finder ProjectWide Finder 2中描述 So, if serial optimizations don't go fast enough for you, this may be a place to start. 因此,如果串行优化不够快,那么这可能是一个开始的地方。 There are examples of this sort of problem contributed in many languages, including python . 在许多语言中都有这种问题的例子, 包括python Key quote from that last link: 最后一个链接的关键引用:

Summary 摘要

In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks: 在本文中,我们使用了一些相对较快的Python实现并使用了许多技巧对其进行了优化:

  • Pre-compiled RE patterns 预编译的RE模式
  • Fast filtering of candidate lines 候选线的快速过滤
  • Chunked reading 大块阅读
  • Multiple processes 多个过程
  • Memory mapping, combined with support for RE operations on mapped buffers 内存映射,以及对映射缓冲区上的RE操作的支持

This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. 这减少了在测试计算机上将200兆字节的日志数据从6.7秒解析为0.8秒所需的时间。 Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim's original Erlang version. 或者换句话说,最终版本比原始Python版本快8倍,并且(可能)比Tim的原始Erlang版本快600倍。

Having said this, 30,000 lines isn't that many so you may want to at least start by investigating your disk read/write performance. 话虽如此,30,000行并不是那么多,所以你可能想要至少从调查磁盘读/写性能开始。 Does it help if you write the output to something other than the disk that you are reading the input from or read the whole file in one go before processing? 如果您将输出写入正在读取输入的磁盘以外的其他内容或在处理之前一次性读取整个文件,这会有帮助吗?

The best bet to speed it up would be if the specific string S0414 always appears at the same character position, so instead of having to make several failed comparisons per line (you said they start with different names) it could just do one and done. 加速它的最佳选择是,如果特定字符串S0414总是出现在相同的字符位置,那么不必每行进行几次失败的比较(你说它们以不同的名字开头),它可以只做一个并完成。

eg if you're file has lines like 例如,如果你的文件有像这样的行

GLY S0414 GCT
ASP S0435 AGG
LEU S0432 CCT

do an if line[4:9] == 'S0414': small.write(line) . 做一个if line[4:9] == 'S0414': small.write(line)

This method assumes the special values appear in the same position on the line in gbigfile 此方法假定特殊值出现在gbigfile的相同位置

def mydict(iterable):
    d = {}
    for k, v in iterable:
        if k in d:
            d[k].append(v)
        else:
            d[k] = [v]
    return d

with open("C:\\to_find.txt", "r") as t:
    tofind = mydict([(x[0], x) for x in t.readlines()])

with open("C:\\gbigfile.txt", "r") as bigfile:
    with open("C:\\outfile.txt", "w") as outfile:
        for line in bigfile:
            seq = line[4:9]
            if seq in tofind[seq[0]]:
                outfile.write(line)

Depending on what the distribution of the starting letter in those targets you can cut your comparisons down by a significant amount. 根据这些目标中起始字母的分布情况,您可以大幅减少比较。 If you don't know where the values will appear you're talking about a LONG operation because you'll have to compare hundreds of thousands - let's say 300,000 -- 30,000 times. 如果你不知道这些值会出现在哪里,你就会谈论一个长时间的操作,因为你需要比较数十万 - 比如300,000 - 30,000次。 That's 9 million comparisons which is going to take a long time. 这是900万次比较,需要长时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM