Python re.sub返回二進制字符

Question

我正在嘗試使用Python中的re.sub（）regex表達式修復JSON feed。 （我也正在與供稿提供商一起對其進行修復）。 我有兩個要修復的表達式：

1。

      "milepost":       "
      "milepost":       "723.46

缺少結尾引號，並且

2。

},

}

其中不應該包含逗號。 請注意，它們之間沒有空白行，只是“}，\\ n}”（此編輯器有問題...）

我有一個摘要的摘要，位於： http : //hardhat.ahmct.ucdavis.edu/tmp/test.txt

下面的示例代碼。 在這里，我進行了測試，以查找模式，然后進行替換。 ＃2的匹配給出了一些奇怪的結果，但我看不出為什么：找到括號匹配：[（'}'，'\\ r \\ n}'）]

排名第一的比賽似乎不錯。

主要問題是，當我執行re.sub時，生成的字符串中包含“ \\ x01 \\ x02”。 我不知道這是從哪里來的。 任何建議，不勝感激。

樣例代碼：

import urllib2
import json
import re

if __name__ == "__main__":
    # wget version of real feed:
    # url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
    # Short text, for milepost and brace substitution test:
    url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
    request = urllib2.urlopen(url)
    rawResponse = request.read()
    # print("Raw response:")
    # print(rawResponse)

    # Find extra comma after end of records:
    p1 = re.compile('(}),(\r?\n *})')
    l1 = p1.findall(rawResponse)
    print("Brace matches found:")
    print(l1)

    # Check milepost:
    #p2 = re.compile('( *\"milepost\": *\")')
    p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
    l2 = p2.findall(rawResponse)
    print("Milepost matches found:")
    print(l2)

    # Do brace substitutions:
    subst = "\1\2"
    response = re.sub(p1, subst, rawResponse)

    # Do milepost substitutions:
    subst = "\1\2\""
    response = re.sub(p2, subst, response)
    print(response)

Answer 1

您需要使用原始字符串，否則Python字符串處理器會將"\\1\\2"解釋為ASCII 01 ASCII 02而不是backslash 1 backslash 2 。

代替

subst = "\1\2"

采用

subst = r"\1\2" # or subst = "\\1\\2"

第二個替換使事情變得有些棘手：

subst = "\1\2\""

需要成為

subst = r'\1\2"' # or subst = "\\1\\2\""

Python re.sub返回二進制字符

問題描述

1 個解決方案

解決方案1
2 已采納 2014-11-18 17:46:49

Python re.sub返回二進制字符

問題描述

1 個解決方案

解決方案1 2 已采納 2014-11-18 17:46:49

解決方案1
2 已采納 2014-11-18 17:46:49