為什么在Python中使用正則表達式搜索“不完全等同於切割字符串”？

Question

正如文檔所述，使用regex.search(string, pos, endpos)並不完全等同於切割字符串，即regex.search(string[pos:endpos]) 。 它不會進行正則表達式匹配，就像字符串從pos開始一樣，因此^與子字符串的開頭不匹配，但只匹配整個字符串的實際開頭。 但是， $匹配子字符串的結尾或整個字符串。

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

我的問題是......為什么在開始和結束比賽之間不一致？ 為什么使用pos和endpos將結尾視為真實結束，但是開始/開始不被視為真正的開始/開始？

是否有任何方法可以使用pos和endpos模仿切片？ 因為Python 在切片時復制字符串而不是僅僅引用舊字符串，所以在使用大字符串多次時使用pos和endpos而不是切片會更有效。

Answer 1

例如，起始位置參數pos對於進行詞法分析器特別有用。 使用[pos:]和使用pos參數切割字符串之間的性能差異可能看起來微不足道，但事實並非如此; 例如，在JsLex詞法分析器中查看此錯誤報告。

實際上， ^匹配字符串的實際開頭; 或者，如果指定了MULTILINE ，也在行的開頭; 這也是設計使得基於正則表達式的掃描儀可以容易地區分輸入的線路/開始的實際開始和輸入線路上/輸入內的其他點。

請注意，您還可以使用regex.match(string[, pos[, endpos]])函數將匹配錨定到起始字符串或 pos指定的位置; 因此，而不是做

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

你通常會將掃描儀實現為

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

然后將pos設置為match.end() （在本例中返回4），用於連續匹配操作。

必須從pos正好開始找到匹配：

>>> re.compile('am').match('I am falling in code', 1, 12)
>>>

（注意.match如何通過隱式^而不是輸入的.match錨定在輸入的開頭;實際上這通常是錯誤的來源，因為人們認為匹配具有隱式^和$ - Python 3.4添加了執行此操作的regex.fullmatch ）

至於為什么endpos參數與pos不一致 - 我endpos知道，但它對我來說也有一些意義，因為在Python 2中沒有完全fullmatch並且用$錨定是唯一的方法來確保必須匹配整個范圍。

Answer 2

這聽起來像Python中的一個錯誤，但如果你想通過引用切片而不是復制字符串，你可以使用Python內置buffer 。

例如：

s = "long string" * 100
buf = buffer(s)
substr = buf([5:15])

這會創建一個子字符串而不復制數據，因此應該允許有效地拆分大字符串。

為什么在Python中使用正則表達式搜索“不完全等同於切割字符串”？

問題描述

2 個解決方案

解決方案1
1 2015-07-22 12:21:30

解決方案2
0 2015-06-27 02:51:02

為什么在Python中使用正則表達式搜索“不完全等同於切割字符串”？

問題描述

2 個解決方案

解決方案1 1 2015-07-22 12:21:30

解決方案2 0 2015-06-27 02:51:02

解決方案1
1 2015-07-22 12:21:30

解決方案2
0 2015-06-27 02:51:02