简体   繁体   中英

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question , which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.

Following is my code taken from that answer:

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
    print(string[prev.end(): curr.start()])  # originally I yield here

I then noticed that if the string starts or ends with delimiters (ie string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d " ) then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:

  1. "(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it , the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.

  2. "(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.

  3. "(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).

My questions are:

  1. Why do I get such a strange behavior in attempt #1
  2. How do I solve the end of string issue?
  3. Am I being a tank, ie is there a simple way to solve this that I'm blindly missing?
  4. remember that the solution can't change the string and must produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know ( and if you don't read no further ) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:

delimiter_re = r'[\n\- ]'     # newline, hyphen, or space
search_regex = r'''^(?!{0})   # string start with no delimiter
                   |          # or
                   {0}+       # sequence of delimiters (at least one)
                   |          # or
                   (?<!{0})$  # string end with no delimiter
                '''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)

Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.

It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:

token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
    do_something_with(string[previous_end:match.start()])
    previous_end = match.end()
do_something_with(string[previous_end:])

The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $ .

The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. ( Here's part of the source, if you want to read it. )

The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^ , then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:

# match any number of consecutive non-delim chars
string = "  dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d  "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
    print(match.group(0))

output:

dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM