简体   繁体   English

Python正则表达式解析流

[英]Python regex parse stream

Is there any way to use regex match on a stream in python?有没有办法在 python 中的流上使用正则表达式匹配? like喜欢

reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))

And I don't want to do this by getting the value of the whole string.我不想通过获取整个字符串的值来做到这一点。 I want to know if there's any way to match regex on a srtream(on-the-fly).我想知道是否有任何方法可以在 srtream(on-the-fly) 上匹配正则表达式。

I had the same problem. 我有同样的问题。 The first thought was to implement a LazyString class, which acts like a string but only reading as much data from the stream as currently needed (I did this by reimplementing __getitem__ and __iter__ to fetch and buffer characters up to the highest position accessed...). 第一个想法是实现一个LazyString类,它像一个字符串,但只读取当前需要的流中的数据(我通过重新实现__getitem____iter__来获取和缓冲字符到达访问的最高位置... )。

This didn't work out (I got a "TypeError: expected string or buffer" from re.match ), so I looked a bit into the implementation of the re module in the standard library. 这没有用(我从re.match得到了一个“TypeError:期望的字符串或缓冲区”),所以我看了一下标准库中re模块的实现。

Unfortunately using regexes on a stream seems not possible. 不幸的是,在流上使用正则表达式似乎是不可能的。 The core of the module is implemented in C and this implementation expects the whole input to be in memory at once (I guess mainly because of performance reasons). 模块的核心是用C实现的,这个实现期望整个输入同时在内存中(我猜主要是因为性能原因)。 There seems to be no easy way to fix this. 似乎没有简单的方法来解决这个问题。

I also had a look at PYL (Python LEX/YACC), but their lexer uses re internally, so this wouldnt solve the issue. 我也看过PYL (Python LEX / YACC),但是他们的词法分子在内部使用re ,所以这不会解决问题。

A possibility could be to use ANTLR which supports a Python backend. 可能是使用支持Python后端的ANTLR It constructs the lexer using pure python code and seems to be able to operate on input streams. 它使用纯python代码构造词法分析器,似乎能够在输入流上运行。 Since for me the problem is not that important (I do not expect my input to be extensively large...), I will probably not investigate that further, but it might be worth a look. 因为对我来说问题并不那么重要(我不希望我的输入变得非常大......),我可能不会进一步调查,但它可能值得一看。

In the specific case of a file, if you can memory-map the file with mmap and if you're working with bytestrings instead of Unicode, you can feed a memory-mapped file to re as if it were a bytestring and it'll just work. 在文件的特定情况下,如果您可以使用mmap对文件进行内存映射,并且如果您使用的是字节串而不是Unicode,则可以将内存映射文件提供给re ,就像它是字节串一样,它会工作。 This is limited by your address space, not your RAM, so a 64-bit machine with 8 GB of RAM can memory-map a 32 GB file just fine. 这受到地址空间的限制,而不受RAM的限制,因此具有8 GB RAM的64位计算机可以内存映射32 GB文件。

If you can do this, it's a really nice option. 如果你能做到这一点,这是一个非常好的选择。 If you can't, you have to turn to messier options. 如果你不能,你必须转向更混乱的选择。


The 3rd-party regex module (not re ) offers partial match support, which can be used to build streaming support... but it's messy and has plenty of caveats. 第三regex模块(不是re提供)提供部分匹配支持,可用于构建流媒体支持......但它很混乱并且有很多警告。 Things like lookbehinds and ^ won't work, zero-width matches would be tricky to get right, and I don't know if it'd interact correctly with other advanced features regex offers and re doesn't. 比如像lookbehinds和^将无法正常工作,零宽度的比赛将是非常棘手得到正确的,我不知道它是否会与其他先进的功能正确交互regex的优惠和re没有。 Still, it seems to be the closest thing to a complete solution available. 尽管如此,它似乎是最完整的解决方案。

If you pass partial=True to regex.match , regex.fullmatch , regex.search , or regex.finditer , then in addition to reporting complete matches, regex will also report things that could be a match if the data was extended: 如果将partial=True传递给regex.matchregex.fullmatchregex.searchregex.finditer ,那么除了报告完整匹配之外, regex还会报告在数据扩展时可能匹配的内容:

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

It'll report a partial match instead of a complete match if more data could change the match result, so for example, regex.search(r'[\\s\\S]*', anything, partial=True) will always be a partial match. 如果更多数据可以更改匹配结果,它将报告部分匹配而不是完全匹配,例如, regex.search(r'[\\s\\S]*', anything, partial=True)将始终为部分匹配。

With this, you can keep a sliding window of data to match, extending it when you hit the end of the window and discarding consumed data from the beginning. 通过这种方式,您可以保持数据的滑动窗口匹配,当您到达窗口末尾并从头开始丢弃消耗的数据时将其扩展。 Unfortunately, anything that would get confused by data disappearing from the start of the string won't work, so lookbehinds, ^ , \\b , and \\B are out. 不幸的是,任何因数据从字符串的开头消失而感到困惑的东西都不会起作用,所以lookbehinds, ^\\b\\B都出来了。 Zero-width matches would also need careful handling. 零宽度匹配也需要小心处理。 Here's a proof of concept that uses a sliding window over a file or file-like object: 这是一个在文件或类文件对象上使用滑动窗口的概念证明:

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2

This seems to be an old problem. 这似乎是一个老问题。 As I have posted to a a similar question , you may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. 正如我发布的类似问题 ,您可能希望将我的解决方案streamsearch-py的Matcher类子类化,并在缓冲区中执行正则表达式匹配。 Check out the kmp_example.py for a template. 查看kmp_example.py以获取模板。 If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-) 如果事实证明经典的Knuth-Morris-Pratt匹配就是您所需要的,那么现在使用这个小型开源库可以解决您的问题:-)

The answers here are now outdated.这里的答案现在已经过时了。 Modern Python re package now supports bytes-like objects, which have an api you can implement yourself and get streaming behaviour.现代 Python re包现在支持 类似字节的对象,这些对象有一个 api,您可以自己实现并获得流媒体行为。

Yes - using the getvalue method: 是 - 使用getvalue方法:

import cStringIO
import re

data = cStringIO.StringIO("some text")
regex = re.compile(r"\w+")
regex.match(data.getvalue())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM