繁体   English   中英

检查字符串中的位置是否在一对特定字符内

[英]Check if a position in string is within a pair of certain characters

在 python 中,确定字符串中的位置是否在一对特定字符序列内的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."

在字符串cost中,我想弄清楚某个位置是由一对$还是在\(\)内。

对于字符串cost ,函数is_maths(cost,x)将为[37,38,39,48]中的x返回True ,并在其他任何地方计算为False

动机是找出有效的乳胶数学位置,也欢迎使用 python 的任何替代有效方法。

您需要将字符串解析到请求的位置,如果在一对有效的 LaTeX 环境分隔符内,直到结束分隔符,才能使用TrueFalse回答。 那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)来确定它们的效果。

我已经明白Latex 的$...$\(...\)环境分隔符是不能嵌套的,所以这里不用担心嵌套语句; 您只需要找到最近的完整$...$\(...\)对。

但是,您不能只匹配文字$\(\)字符,因为每个字符前面都可以有任意数量的\反斜杠。 相反,在反斜杠、美元或括号上标记输入字符串,并按顺序迭代标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。

如果您超出了请求的位置并且超出了数学环境部分,则无需继续解析; 那时你已经有了答案,可以提前返回False

这是我对这种解析器的实现:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

演示:

>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

和其他测试以表明转义得到正确处理:

>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27)  # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27)  # no longer escaped, so false
False

我的实现刻意忽略了格式错误的 LaTeX 问题; \(...\)中未转义的$字符或$...$中的转义\(\)字符被忽略, \(...\) \)中的进一步\(开场符或不带匹配前面的\(开启器。这确保即使给定 LaTeX 本身不会呈现的输入,该函数也能继续工作。但是,在这些情况下,可以更改解析器以引发异常或返回False 。在这种情况下,您需要添加从_math_pairs.keys() | _math_pairs.values()创建的全局集,并在expected_closer is not None and (token, escaped) != expected_closer为 false 时针对该集测试(char, escaped) ) 并测试对于char == ')' and escaped and expected_closer is None来检测\)更接近没有开瓶器问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM