简体   繁体   English

Python,使用正则表达式在具有重叠匹配项的中间字符上拆分字符串

[英]Python, splitting strings on middle characters with overlapping matches using regex

In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (eg ^(\\w{0,2})o(\\w{0,2})$ ). 在Python中,我使用正则表达式从字典中检索显示特定模式的字符串,例如具有比特定字符更多的字符重复和另一个重复部分(例如^(\\w{0,2})o(\\w{0,2})$ )。

This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. 这可以按预期工作,但是现在我想使用中心字符作为分隔符将字符串分成两个子字符串(最终一个可能为空)。 The issue I am having stems from the possibility of multiple overlapping matches inside a string (eg I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m) ). 我遇到的问题源于字符串内可能存在多个重叠匹配项(例如,我想使用以前的正则表达式以(r,om)(ro,m)两种不同方式拆分字符串空间 )。

Both re.search().groups() and re.findall() did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods. re.search().groups()re.findall()均未解决此问题,并且re模块上的文档似乎指出,方法不会返回重叠的匹配项。

Here is a snippet showing the undesired behaviour: 以下是显示不良行为的代码段:

import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
    matches = regex.findall(word) 
    if matches:
        halves.append(matches)

I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. 我将其发布为答案,主要是为了避免将来有人在这里绊倒的情况下回答该问题,并且由于我设法达到了预期的行为,尽管可能不是很Python的方式,但这可能是有用的别人的出发点。 Some notes on how improve this answer (ie making more "pythonic" or simply more efficient would be very welcomed). 关于如何改善此答案的一些注意事项(例如,使“ pythonic”更简单或更有效)将非常受欢迎。

The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using the re and the new regex modules involves using multiple regexes. 使用“合法”位置中的字符作为定界符(使用re和新的regex模块)来获得长度在一定范围内的字符和位置在一定范围内的字符的所有可能的单词拆分的唯一方法多个正则表达式。 This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character. 该代码片段允许在运行时创建适当的正则表达式,从而知道单词的长度范围,要查找的字符以及此类字符可能位置的范围。

dictionary = ('room', 'roam', 'flow', 'door', 'window', 
              'desk', 'for', 'fo', 'foo', 'of', 'sorrow')
char = 'o'
word_len = (3, 6)
char_pos = (2, 3)
regex_str = '(?=^\w{'+str(word_len[0])+','+str(word_len[1])+'}$)(?=\w{'
             +str(char_pos[0]-1)+','+str(char_pos[1]-1)+'}'+char+')'
halves = []
for word in dictionary:
    matches = re.match(regex_str, word)
    if matches:
        matched_halves = []
        for pos in xrange(char_pos[0]-1, char_pos[1]):
            split_regex_str = '(?<=^\w{'+str(pos)+'})'+char
            split_word =re.split(split_regex_str, word)
            if len(split_word) == 2:
                matched_halves.append(split_word)
        halves.append(matched_halves)

The output is: 输出为:

[[['r', 'om'], ['ro', 'm']], [['r', 'am']], [['fl', 'w']], [['d', 'or'], ['do', 'r']], [['f', 'r']], [['f', 'o'], ['fo', '']], [['s', 'rrow']]]

At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal char . 在这一点上,我可能会开始考虑使用正则表达式只是为了查找要拆分的to单词,然后以“哑方式”进行拆分,只是检查范围位置中的char是否等于char Anyhow, any remark is extremely appreciated. 无论如何,任何评论都非常感谢。

EDIT: Fixed. 编辑:固定。

Does a simple while loop work? 一个简单的while循环有效吗?

What you want is re.search and then loop with a 1 shift: https://docs.python.org/2/library/re.html 您想要的是re.search,然后以1个班次循环: https : //docs.python.org/2/library/re.html

>>> dictionary = ('room', 'door', 'window', 'desk', 'for')
>>> regex = re.compile('(\w{0,2})o(\w{0,2})')
>>> halves = []
>>> for word in dictionary:
>>>     start = 0
>>>     while start < len(word):
>>>         match = regex.search(word, start)
>>>         if match:
>>>             start = match.start() + 1
>>>             halves.append([match.group(1), match.group(2)])
>>>         else:
>>>            # no matches left
>>>            break

>>> print halves
[['ro', 'm'], ['o', 'm'], ['', 'm'], ['do', 'r'], ['o', 'r'], ['', 'r'], ['nd', 'w'], ['d', 'w'], ['', 'w'], ['f', 'r'], ['', 'r']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM