简体   繁体   English

尝试使用 python 更正格式不正确的 JSON 字符串

[英]Trying to correct an improperly formatted JSON string using python

I'm trying to use any combination of the Python " re " library and python slice to correct this improperly formatted JSON string that Kafka is giving us on HDFS using Cloudera's Hadoop distribution.我正在尝试使用Pythonre ”库和 python 切片的任意组合来纠正 Kafka 在 HDFS 上使用 Cloudera 的 Hadoop 发行版提供给我们的格式不正确的 JSON 字符串。

incorrect json:不正确的json:

{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}"}

NOTE: the double quotes near the beginning tag " json_data ": " { and the double quotes near the end on " null }} " } are actually the only things wrong that need to be removed (I've tested it without the extra quotes)注意:开始标记附近的双引号“ json_data ”: { 和末尾附近的双引号“ null }} } 实际上是唯一需要删除的错误(我已经在没有额外引号的情况下对其进行了测试) )

valid and correct json:有效且正确的 json:

{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}}

I have between 40,000 to 60,000 records I would need to read thru per hour using Pyspark and the Infrastructure team says it's on me to fix.我每小时需要使用 Pyspark 读取40,000 到 60,000 条记录,基础设施团队说我需要修复。

Is there a quick and dirty way using python to read all the strings and remove the double quotes near the beginning and near the end?是否有一种快速而肮脏的方法使用 python 读取所有字符串并删除开头和结尾附近的双引号?

For the string offered I do suggest you stick with re a regex such as:对于字符串提供我建议你坚持re正则表达式,如:

'(?<=:|\})(")(?=\}|\{)'

Should do the trick.应该做的伎俩。 Since the double quotes that are not needed follow closing blackets or a colon and preced opening or closing brackets.由于不需要的双引号跟在结束的黑色或冒号之后,并且在开始或结束括号之前。

import re
import json

string = '{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}"}}'

trimmed_string = re.sub('(?<=:|\})(")(?=\}|\{)', '', string)

data = json.loads(trimmed_string)

Results:结果:

{'json_data': {'table': 'TEST.FUBAR', 'op_type': 'I', 'op_ts': '2019-03-14 15:33:50.031848','current_ts': '2019-03-14T15:33:57.479002', 'pos': '1111', 'after': {'COL1': 949494949494949494, 'COL2': 99, 'COL3': 2, 'COL4': '            99999', 'COL5': 9999999, 'COL6': 90, 'COL7':42478, 'COL8': 'I', 'COL9': None, 'COL10': '2019-03-14 15:33:49', 'COL11': None, 'COL12': None, 'COL13': None, 'COL14': 'x222263 ', 'COL15': '2019-03-14 15:33:49', 'COL16': 'x222263 ', 'COL17': '2019-03-14 15:33:49', 'COL18': '2020-09-10 00:00:00', 'COL19': 'A', 'COL20': 'A', 'COL21': 0, 'COL22': None, 'COL23': '2019-03-14 15:33:47', 'COL24': 2, 'COL25': 2, 'COL26': 'R', 'COL27': '2019-03-14 15:33:49', 'COL28': '  ', 'COL29': 'PBU67H   ', 'COL30': '20000', 'COL31': 2, 'COL32': None}}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM