简体   繁体   English

检查字符串中的位置是否在一对特定字符内

[英]Check if a position in string is within a pair of certain characters

In python, what would be the most efficient way to figure out if a position in a string is within a pair of certain character sequences?在 python 中,确定字符串中的位置是否在一对特定字符序列内的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."

In the string cost , I want to figure out if a certain position is enclosed by a pair of $ or within \( and \) .在字符串cost中,我想弄清楚某个位置是由一对$还是在\(\)内。

For the string cost a function is_maths(cost,x) would return True for x in [37,38,39,48] and evaluate to False for everywhere else.对于字符串cost ,函数is_maths(cost,x)将为[37,38,39,48]中的x返回True ,并在其他任何地方计算为False

The motivation is to figure out valid latex maths positions, any alternate efficient ways using python are also welcome.动机是找出有效的乳胶数学位置,也欢迎使用 python 的任何替代有效方法。

You'll need to parse the string up to the requested position, and if inside a valid pair of LaTeX environment delimiters, up to the closing delimiter, to be able to answer with True or False .您需要将字符串解析到请求的位置,如果在一对有效的 LaTeX 环境分隔符内,直到结束分隔符,才能使用TrueFalse回答。 That's because you have to process each relevant metacharacter (backslashes, dollars and parentheses) to determine their effect.那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)来确定它们的效果。

I've understood that Latex's $...$ and \(...\) environment delimiters can't be nested, so you don't have to worry about nested statements here;我已经明白Latex 的$...$\(...\)环境分隔符是不能嵌套的,所以这里不用担心嵌套语句; you only need to find the nearest complete $...$ or \(...\) pair.您只需要找到最近的完整$...$\(...\)对。

You can't just match literal $ or \( or \) characters, however, because each of these could be preceded by an arbitrary number of \ backslashes.但是,您不能只匹配文字$\(\)字符,因为每个字符前面都可以有任意数量的\反斜杠。 Instead, tokenize the input string on backslashes, dollars or parentheses, and iterate over the tokens in order and track what was last matched to determine their effect (escape the next character, and opening and closing maths environments).相反,在反斜杠、美元或括号上标记输入字符串,并按顺序迭代标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。

You don't need to continue parsing if you are past the requested position and outside of a maths environment section;如果您超出了请求的位置并且超出了数学环境部分,则无需继续解析; you already have your answer then and can return False early.那时你已经有了答案,可以提前返回False

Here's my implementation of such a parser:这是我对这种解析器的实现:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

Demo:演示:

>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

and additional tests to show that escapes are handled correctly:和其他测试以表明转义得到正确处理:

>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27)  # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27)  # no longer escaped, so false
False

My implementation studiously ignores malformed LaTeX issues;我的实现刻意忽略了格式错误的 LaTeX 问题; unescaped $ characters within \(...\) or escaped \( and \) characters within $...$ are ignored, as are further \( openers inside \(...\) sequences, or \) closers without a matching \( opener preceding. This makes sure the function continues to work even when given input that LaTeX itself would not render. The parser can be altered to throw an exception or return False in those cases, however. In that case you need to add a global set created from _math_pairs.keys() | _math_pairs.values() and test (char, escaped) against that set when expected_closer is not None and (token, escaped) != expected_closer is false (detecting nested environment delimiters) and test for char == ')' and escaped and expected_closer is None to detect the \) closer without an opener problem. \(...\)中未转义的$字符或$...$中的转义\(\)字符被忽略, \(...\) \)中的进一步\(开场符或不带匹配前面的\(开启器。这确保即使给定 LaTeX 本身不会呈现的输入,该函数也能继续工作。但是,在这些情况下,可以更改解析器以引发异常或返回False 。在这种情况下,您需要添加从_math_pairs.keys() | _math_pairs.values()创建的全局集,并在expected_closer is not None and (token, escaped) != expected_closer为 false 时针对该集测试(char, escaped) ) 并测试对于char == ')' and escaped and expected_closer is None来检测\)更接近没有开瓶器问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在字符串中的某个位置更改字符... Python - How to change characters in a certain position within a string… Python 删除某些字符内的字符串 - Remove string within certain characters 如何检查某些字符是否不在字符串中? - How to check if certain characters aren't in a string? python交换字符对和反向2个字符的位置 - python exchange position for pair of characters and reverse 2 characters 在跳过某些字符时查找字符串中的位置 - find position in string while skipping over certain characters 如何在列表中的某个点之后从字符串中删除字符? - How to remove characters from a string after a certain point within a list? 从字符串中提取特定范围内的unicode字符 - Extract unicode characters within a certain range from a string 正则表达式在每个长字符串中查找某些数字、字母和字符 - Regex to look for certain numbers, letters and characters within each of the long string 查找字符串中的一对字符并将其替换为该字符串中的另一对字符 - Finding and replacing a pair of characters in a string with another pair of characters in that string 如何使用Python Reg提取一对具有某些模式且由字符串中任何字符分隔的子字符串? - How to extract a pair of sub-strings with certain patterns separated by any characters in a string using Python Reg?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM