[英]Check if a position in string is within a pair of certain characters
In python, what would be the most efficient way to figure out if a position in a string is within a pair of certain character sequences?在 python 中,确定字符串中的位置是否在一对特定字符序列内的最有效方法是什么?
0--------------16-------------------37---------48--------57
| | | | |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."
In the string cost
, I want to figure out if a certain position is enclosed by a pair of $
or within \(
and \)
.在字符串
cost
中,我想弄清楚某个位置是由一对$
还是在\(
和\)
内。
For the string cost
a function is_maths(cost,x)
would return True
for x
in [37,38,39,48]
and evaluate to False
for everywhere else.对于字符串
cost
,函数is_maths(cost,x)
将为[37,38,39,48]
中的x
返回True
,并在其他任何地方计算为False
。
The motivation is to figure out valid latex maths positions, any alternate efficient ways using python are also welcome.动机是找出有效的乳胶数学位置,也欢迎使用 python 的任何替代有效方法。
You'll need to parse the string up to the requested position, and if inside a valid pair of LaTeX environment delimiters, up to the closing delimiter, to be able to answer with True
or False
.您需要将字符串解析到请求的位置,如果在一对有效的 LaTeX 环境分隔符内,直到结束分隔符,才能使用
True
或False
回答。 That's because you have to process each relevant metacharacter (backslashes, dollars and parentheses) to determine their effect.那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)来确定它们的效果。
I've understood that Latex's $...$
and \(...\)
environment delimiters can't be nested, so you don't have to worry about nested statements here;我已经明白Latex 的
$...$
和\(...\)
环境分隔符是不能嵌套的,所以这里不用担心嵌套语句; you only need to find the nearest complete $...$
or \(...\)
pair.您只需要找到最近的完整
$...$
或\(...\)
对。
You can't just match literal $
or \(
or \)
characters, however, because each of these could be preceded by an arbitrary number of \
backslashes.但是,您不能只匹配文字
$
或\(
或\)
字符,因为每个字符前面都可以有任意数量的\
反斜杠。 Instead, tokenize the input string on backslashes, dollars or parentheses, and iterate over the tokens in order and track what was last matched to determine their effect (escape the next character, and opening and closing maths environments).相反,在反斜杠、美元或括号上标记输入字符串,并按顺序迭代标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。
You don't need to continue parsing if you are past the requested position and outside of a maths environment section;如果您超出了请求的位置并且超出了数学环境部分,则无需继续解析; you already have your answer then and can return
False
early.那时你已经有了答案,可以提前返回
False
。
Here's my implementation of such a parser:这是我对这种解析器的实现:
import re
_maths_pairs = {
# keys are opening characters, values matching closing characters
# each is a tuple of char (string), escaped (boolean)
('$', False): ('$', False),
('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')
def _tokenize(s):
"""Generator that produces token, pos, prev_pos tuples for s
* token is a single character: a backslash, dollar or parethesis
* pos is the index into s for that token
* prev_pos is te position of the preceding token, or -1 if there
was no preceding token
"""
prev_pos = -1
for match in _tokens.finditer(s):
token, pos = match[0], match.start()
yield token, pos, prev_pos
prev_pos = pos
def is_maths(s, pos):
"""Determines if pos in s is within a LaTeX maths environment"""
expected_closer = None # (char, escaped) if within $...$ or \(...\)
opener_pos = None # position of last opener character
escaped = False # True if the most recent token was an escaping backslash
for token, token_pos, prev_pos in _tokenize(s):
if expected_closer is None and token_pos > pos:
# we are past the desired position, it'll never be within a
# maths environment.
return False
# if there was more text between the current token and the last
# backslash, then that backslash applied to something else.
if escaped and token_pos > prev_pos + 1:
escaped = False
if token == '\\':
# toggle the escaped flag; doubled escapes negate
escaped = not escaped
elif (token, escaped) == expected_closer:
if opener_pos < pos < token_pos:
# position is after the opener, before the closer
# so within a maths environment.
return True
expected_closer = None
elif expected_closer is None and (token, escaped) in _maths_pairs:
expected_closer = _maths_pairs[(token, escaped)]
opener_pos = token_pos
prev_pos = token_pos
return False
Demo:演示:
>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0) # should be False
False
>>> is_maths(cost, 16) # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37) # should be True, within $...$
True
>>> is_maths(cost, 48) # should be True, within \(...\)
True
>>> is_maths(cost, 57) # should be False, within unescaped (...)
False
and additional tests to show that escapes are handled correctly:和其他测试以表明转义得到正确处理:
>>> is_maths(r'Doubled escapes negate: \\$x^2$', 27) # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27) # no longer escaped, so false
False
My implementation studiously ignores malformed LaTeX issues;我的实现刻意忽略了格式错误的 LaTeX 问题; unescaped
$
characters within \(...\)
or escaped \(
and \)
characters within $...$
are ignored, as are further \(
openers inside \(...\)
sequences, or \)
closers without a matching \(
opener preceding. This makes sure the function continues to work even when given input that LaTeX itself would not render. The parser can be altered to throw an exception or return False
in those cases, however. In that case you need to add a global set created from _math_pairs.keys() | _math_pairs.values()
and test (char, escaped)
against that set when expected_closer is not None and (token, escaped) != expected_closer
is false (detecting nested environment delimiters) and test for char == ')' and escaped and expected_closer is None
to detect the \)
closer without an opener problem. \(...\)
中未转义的$
字符或$...$
中的转义\(
和\)
字符被忽略, \(...\)
\)
中的进一步\(
开场符或不带匹配前面的\(
开启器。这确保即使给定 LaTeX 本身不会呈现的输入,该函数也能继续工作。但是,在这些情况下,可以更改解析器以引发异常或返回False
。在这种情况下,您需要添加从_math_pairs.keys() | _math_pairs.values()
创建的全局集,并在expected_closer is not None and (token, escaped) != expected_closer
为 false 时针对该集测试(char, escaped)
) 并测试对于char == ')' and escaped and expected_closer is None
来检测\)
更接近没有开瓶器问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.