简体   繁体   English

如何在python中构建基于标记化正则表达式的迭代器

[英]How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question , which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe. 我将这个问题基于对另一个SO问题的回答 ,这是使用more_itertools的成对迭代器配方对基于标记的正则表达式迭代器的 具体尝试

Following is my code taken from that answer: 以下是我从该答案中获取的代码:

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
    print(string[prev.end(): curr.start()])  # originally I yield here

I then noticed that if the string starts or ends with delimiters (ie string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d " ) then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes: 然后,我注意到,如果字符串以分隔符开始或结束(例如string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d " ),则标记生成器将在开头打印空字符串(实际上是字符串开头和字符串结尾的额外匹配项)以及其令牌输出列表的末尾,因此,为了解决这个问题,我尝试了以下(非常难看)其他正则表达式的尝试:

  1. "(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it , the string start somehow also consumes the character following it! “(?:^ | [] | $)+” -这似乎很简单,好像它应该可以工作,但是由于某种原因,它不会(但在其他正则表达式引擎上的行为也大相径庭) ,因此无法构建从字符串开头的单个匹配项及其后的分隔符 ,字符串开头以某种方式还会消耗其后的字符! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself. (这也是我看到的与其他引擎不同的地方,这是一个BUG吗?还是与特殊的非有形字符和我不知道的python中的or(|)运算符有关?),此解决方案一旦匹配了分隔符,然后又对字符串结尾($)字符本身进行了另一次匹配,则对包含字符串结尾的双精度匹配也没有任何作用。

  2. "(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem. “(?:[] | $ | ^)+” -首先放置定界符实际上解决了一个问题,开始处的拆分不包含字符串开头(但我对此并不在意, '对标记本身很感兴趣),当字符串的开头没有定界符但字符串的末尾仍然存在问题时,它也匹配字符串的开头。

  3. "(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches). “((^ [] *)|([] * $)|([] +)” -最终尝试使字符串开始成为第一个匹配项的一部分(在第一个匹配项中实际上并没有太大的问题位置),但尽我所能,我无法摆脱定界符+末尾然后出现定界符匹配问题(这会产生一个额外的空字符串),但是,我仍向您展示此示例(带有分组),因为它显示了末尾特殊字符$被匹配两次,一次是与前面的定界符匹配,一次是单独匹配(2组2匹配)。

My questions are: 我的问题是:

  1. Why do I get such a strange behavior in attempt #1 为什么我在尝试#1时得到如此奇怪的行为
  2. How do I solve the end of string issue? 如何解决字符串结尾问题?
  3. Am I being a tank, ie is there a simple way to solve this that I'm blindly missing? 我是坦克吗,也就是说,有一种简单的方法可以解决我盲目失踪的问题吗?
  4. remember that the solution can't change the string and must produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know ( and if you don't read no further ) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages) 请记住,解决方案不能更改字符串,并且必须产生一个可迭代的生成器 ,该生成器在标记之间的空间迭代,而不是在标记本身之间进行迭代 (这最后一部分似乎不必要地使答案复杂化,因为否则我会有一个简单的答案,但是如果您必须知道( 如果您不做进一步的阅读 )这是我正在构建的一个更大框架的一部分,该生成方法由管道继承,然后该管道以各种模式从中构造出产生语句,用于从中提取字段半结构化分类器驱动的消息)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. 您遇到的问题是由于零宽度匹配的棘手性和未记录的边缘情况造成的。 You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end: 您可以通过使用否定的环视解决方案来解决它们,以明确告诉Python如果字符串的开头或结尾有定界符,则不要为^$生成匹配项:

delimiter_re = r'[\n\- ]'     # newline, hyphen, or space
search_regex = r'''^(?!{0})   # string start with no delimiter
                   |          # or
                   {0}+       # sequence of delimiters (at least one)
                   |          # or
                   (?<!{0})$  # string end with no delimiter
                '''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)

Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches. 请注意,这将在一个空字符串中生成一个匹配项,而不是零,并且不会将开始和结束匹配项分开。

It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want: 遍历非定界符序列并使用结果匹配项来查找所需的字符串组件可能更简单:

token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
    do_something_with(string[previous_end:match.start()])
    previous_end = match.end()
do_something_with(string[previous_end:])

The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $ . 在字符串末尾得到的额外匹配项是因为在末尾匹配了定界符序列之后,正则表达式引擎再次在末尾查找匹配项,并找到了$的零宽度匹配项。

The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | 您在^|...模式的字符串开头得到的行为比较棘手:正则表达式引擎在字符串开头看到^的零宽度匹配并发出它,而没有尝试其他|字符| alternatives. 备择方案。 After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; 在零宽度匹配之后,引擎需要避免再次产生该匹配,以免产生无限循环。 this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. 这个特定的引擎似乎是通过跳过字符来做到这一点的,但是详细信息未记录在案,并且源代码也很难导航。 ( Here's part of the source, if you want to read it. ) 如果您想阅读,这是源代码的一部分。

The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. 您在(?:^|...)+模式的字符串开头所得到的行为更加棘手。 Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^ , then look for another match, find ^ again, then look for another match ad infinitum. 直接执行此操作,引擎将在字符串开头寻找(?:^|...)的匹配项,找到^ ,然后寻找另一个匹配项,再次寻找^ ,然后无限期寻找另一个匹配项。 There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is. 有一些未记录的处理使它永远无法运行,并且这种处理似乎产生了零宽度匹配,但是我不知道该处理是什么。

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. 听起来您只是想返回所有“单词”的列表,并以任意数量的排字字符分隔。 You could instead just use regex groups and the negation regex ^ to achieve this: 相反,您可以只使用正则表达式组和否定正则表达式^实现此目的:

# match any number of consecutive non-delim chars
string = "  dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d  "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
    print(match.group(0))

output: 输出:

dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM