简体   繁体   中英

Python re.sub returning binary characters

I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:

1.

      "milepost":       "
      "milepost":       "723.46

which are missing an end quote, and

2.

    },

}

which shouldn't have the comma. Note, there is no blank line between them, it's just "},\\n }" (trouble with this editor...)

I have a short snippet of the feed, located at: http://hardhat.ahmct.ucdavis.edu/tmp/test.txt

Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why: Brace matches found: [('}', '\\r\\n }')]

The match for #1 seems good.

Main problem is, when I do the re.sub, my resulting string has "\\x01\\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.

Sample code:

import urllib2
import json
import re

if __name__ == "__main__":
    # wget version of real feed:
    # url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
    # Short text, for milepost and brace substitution test:
    url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
    request = urllib2.urlopen(url)
    rawResponse = request.read()
    # print("Raw response:")
    # print(rawResponse)

    # Find extra comma after end of records:
    p1 = re.compile('(}),(\r?\n *})')
    l1 = p1.findall(rawResponse)
    print("Brace matches found:")
    print(l1)

    # Check milepost:
    #p2 = re.compile('( *\"milepost\": *\")')
    p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
    l2 = p2.findall(rawResponse)
    print("Milepost matches found:")
    print(l2)

    # Do brace substitutions:
    subst = "\1\2"
    response = re.sub(p1, subst, rawResponse)

    # Do milepost substitutions:
    subst = "\1\2\""
    response = re.sub(p2, subst, response)
    print(response)

You need to use raw strings, or "\\1\\2" will be interpreted by the Python string processor as ASCII 01 ASCII 02 instead of backslash 1 backslash 2 .

Instead of

subst = "\1\2"

use

subst = r"\1\2" # or subst = "\\1\\2"

Things get a bit trickier with the second replacement:

subst = "\1\2\""

needs to become

subst = r'\1\2"' # or subst = "\\1\\2\""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM