I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:
1.
"milepost": "
"milepost": "723.46
which are missing an end quote, and
2.
},
}
which shouldn't have the comma. Note, there is no blank line between them, it's just "},\\n }" (trouble with this editor...)
I have a short snippet of the feed, located at: http://hardhat.ahmct.ucdavis.edu/tmp/test.txt
Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why: Brace matches found: [('}', '\\r\\n }')]
The match for #1 seems good.
Main problem is, when I do the re.sub, my resulting string has "\\x01\\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.
Sample code:
import urllib2
import json
import re
if __name__ == "__main__":
# wget version of real feed:
# url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
# Short text, for milepost and brace substitution test:
url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
request = urllib2.urlopen(url)
rawResponse = request.read()
# print("Raw response:")
# print(rawResponse)
# Find extra comma after end of records:
p1 = re.compile('(}),(\r?\n *})')
l1 = p1.findall(rawResponse)
print("Brace matches found:")
print(l1)
# Check milepost:
#p2 = re.compile('( *\"milepost\": *\")')
p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
l2 = p2.findall(rawResponse)
print("Milepost matches found:")
print(l2)
# Do brace substitutions:
subst = "\1\2"
response = re.sub(p1, subst, rawResponse)
# Do milepost substitutions:
subst = "\1\2\""
response = re.sub(p2, subst, response)
print(response)
You need to use raw strings, or "\\1\\2"
will be interpreted by the Python string processor as ASCII 01
ASCII 02
instead of backslash 1 backslash 2
.
Instead of
subst = "\1\2"
use
subst = r"\1\2" # or subst = "\\1\\2"
Things get a bit trickier with the second replacement:
subst = "\1\2\""
needs to become
subst = r'\1\2"' # or subst = "\\1\\2\""
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.