簡體   English   中英

檢查字符串中的位置是否在一對特定字符內

[英]Check if a position in string is within a pair of certain characters

在 python 中,確定字符串中的位置是否在一對特定字符序列內的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."

在字符串cost中,我想弄清楚某個位置是由一對$還是在\(\)內。

對於字符串cost ,函數is_maths(cost,x)將為[37,38,39,48]中的x返回True ,並在其他任何地方計算為False

動機是找出有效的乳膠數學位置,也歡迎使用 python 的任何替代有效方法。

您需要將字符串解析到請求的位置,如果在一對有效的 LaTeX 環境分隔符內,直到結束分隔符,才能使用TrueFalse回答。 那是因為您必須處理每個相關的元字符(反斜杠、美元和括號)來確定它們的效果。

我已經明白Latex 的$...$\(...\)環境分隔符是不能嵌套的,所以這里不用擔心嵌套語句; 您只需要找到最近的完整$...$\(...\)對。

但是,您不能只匹配文字$\(\)字符,因為每個字符前面都可以有任意數量的\反斜杠。 相反,在反斜杠、美元或括號上標記輸入字符串,並按順序迭代標記並跟蹤最后匹配的內容以確定它們的效果(轉義下一個字符,以及打開和關閉數學環境)。

如果您超出了請求的位置並且超出了數學環境部分,則無需繼續解析; 那時你已經有了答案,可以提前返回False

這是我對這種解析器的實現:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

演示:

>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

和其他測試以表明轉義得到正確處理:

>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27)  # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27)  # no longer escaped, so false
False

我的實現刻意忽略了格式錯誤的 LaTeX 問題; \(...\)中未轉義的$字符或$...$中的轉義\(\)字符被忽略, \(...\) \)中的進一步\(開場符或不帶匹配前面的\(開啟器。這確保即使給定 LaTeX 本身不會呈現的輸入,該函數也能繼續工作。但是,在這些情況下,可以更改解析器以引發異常或返回False 。在這種情況下,您需要添加從_math_pairs.keys() | _math_pairs.values()創建的全局集,並在expected_closer is not None and (token, escaped) != expected_closer為 false 時針對該集測試(char, escaped) ) 並測試對於char == ')' and escaped and expected_closer is None來檢測\)更接近沒有開瓶器問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM