[英]Python REGEX: remove $ and Curly Brackets from string elements if string element is prefixed by $
In a Python notebook I have a string that I would like to parse in a particular manner and I just can't figure out the necessary regex.在 Python 笔记本中,我有一个想要以特定方式解析的字符串,但我无法找出必要的正则表达式。 This is not important, but the string was priorly a complex nested dictionary derived from transforming an Oozie workflow xml into a Python dictionary via the json.dump() method.这并不重要,但该字符串之前是一个复杂的嵌套字典,该字典源自通过 json 方法将 Oozie 工作流 xml 转换为 Python 字典。
'{"workflow-app": {"@xmlns": "uri:oozie:workflow:0.4", "@name": "simple-Workflow", "start": {"@to": "Create_External_Table"}, "action": [{"@name": "Create_External_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/external.hive}"}, "ok": {"@to": "Create_orc_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Create_orc_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/orc.hive}"}, "ok": {"@to": "Insert_into_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Insert_into_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/Copydata.hive}", "param": "${database_name}"}, "ok": {"@to": "end"}, "error": {"@to": "kill_job"}}], "kill": {"@name": "kill_job", "message": "Job failed"}, "end": {"@name": "end"}}}'
In either case, you'll notice that some of the elements in the string are prefixed by a Dollar Sign.无论哪种情况,您都会注意到字符串中的某些元素以美元符号为前缀。 For example "${xyz.com:8088}", "${hdfs_path_of_script/external.hive}" and a couple more.例如“${xyz.com:8088}”、“${hdfs_path_of_script/external.hive}”等等。
Other elements are wrapped by curly braces as well, but for those and only those elements that are prefixed by a Dollar Sign, I want to remove the Dollar Sign prefix and the curly braces that immediately wrap around it.其他元素也由花括号包裹,但对于那些且只有那些以美元符号为前缀的元素,我想删除美元符号前缀和立即包裹它的花括号。
In the above two examples, I would like to obtain "xyz.com:8088" and "hdfs_path_of_script/external.hive".在上面的两个例子中,我想获取“xyz.com:8088”和“hdfs_path_of_script/external.hive”。 This is what the string would ultimately look like.这就是字符串最终的样子。
'{"workflow-app": {"@xmlns": "uri:oozie:workflow:0.4", "@name": "simple-Workflow", "start": {"@to": "Create_External_Table"}, "action": [{"@name": "Create_External_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/external.hive"}, "ok": {"@to": "Create_orc_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Create_orc_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/orc.hive"}, "ok": {"@to": "Insert_into_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Insert_into_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/Copydata.hive", "param": "database_name"}, "ok": {"@to": "end"}, "error": {"@to": "kill_job"}}], "kill": {"@name": "kill_job", "message": "Job failed"}, "end": {"@name": "end"}}}'
Would someone please help me parse this thing?有人可以帮我解析这个东西吗? I am using Python 3.7 if it matters.如果重要的话,我正在使用 Python 3.7。
You can use recursion to traverse the dictionary and change the appropriate values:您可以使用递归来遍历字典并更改适当的值:
import re
import json
pat = re.compile(r"\$\{(.*)\}")
def transform(d):
if isinstance(d, dict):
for k, v in d.items():
if isinstance(v, str):
d[k] = pat.sub(r"\1", v)
else:
transform(v)
elif isinstance(d, list):
for v in d:
transform(v)
s = '{"workflow-app": {"@xmlns": "uri:oozie:workflow:0.4", "@name": "simple-Workflow", "start": {"@to": "Create_External_Table"}, "action": [{"@name": "Create_External_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/external.hive}"}, "ok": {"@to": "Create_orc_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Create_orc_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/orc.hive}"}, "ok": {"@to": "Insert_into_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Insert_into_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/Copydata.hive}", "param": "${database_name}"}, "ok": {"@to": "end"}, "error": {"@to": "kill_job"}}], "kill": {"@name": "kill_job", "message": "Job failed"}, "end": {"@name": "end"}}}'
data = json.loads(s)
transform(data)
print(json.dumps(data, indent=4))
Prints:印刷:
{
"workflow-app": {
"@xmlns": "uri:oozie:workflow:0.4",
"@name": "simple-Workflow",
"start": {
"@to": "Create_External_Table"
},
"action": [
{
"@name": "Create_External_Table",
"hive": {
"@xmlns": "uri:oozie:hive-action:0.4",
"job-tracker": "xyz.com:8088",
"name-node": "hdfs://rootname",
"script": "hdfs_path_of_script/external.hive"
},
"ok": {
"@to": "Create_orc_Table"
},
"error": {
"@to": "kill_job"
}
},
{
"@name": "Create_orc_Table",
"hive": {
"@xmlns": "uri:oozie:hive-action:0.4",
"job-tracker": "xyz.com:8088",
"name-node": "hdfs://rootname",
"script": "hdfs_path_of_script/orc.hive"
},
"ok": {
"@to": "Insert_into_Table"
},
"error": {
"@to": "kill_job"
}
},
{
"@name": "Insert_into_Table",
"hive": {
"@xmlns": "uri:oozie:hive-action:0.4",
"job-tracker": "xyz.com:8088",
"name-node": "hdfs://rootname",
"script": "hdfs_path_of_script/Copydata.hive",
"param": "database_name"
},
"ok": {
"@to": "end"
},
"error": {
"@to": "kill_job"
}
}
],
"kill": {
"@name": "kill_job",
"message": "Job failed"
},
"end": {
"@name": "end"
}
}
}
I'd probably load with json
and process the data, but this regex does what you want:我可能会加载json
并处理数据,但是这个正则表达式可以满足您的要求:
import re
# your original JSON
ins = '{"workflow-app": {"@xmlns": "uri:oozie:workflow:0.4", "@name": "simple-Workflow", "start": {"@to": "Create_External_Table"}, "action": [{"@name": "Create_External_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/external.hive}"}, "ok": {"@to": "Create_orc_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Create_orc_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/orc.hive}"}, "ok": {"@to": "Insert_into_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Insert_into_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "${xyz.com:8088}", "name-node": "${hdfs://rootname}", "script": "${hdfs_path_of_script/Copydata.hive}", "param": "${database_name}"}, "ok": {"@to": "end"}, "error": {"@to": "kill_job"}}], "kill": {"@name": "kill_job", "message": "Job failed"}, "end": {"@name": "end"}}}'
# this is your expected output string
outs = '{"workflow-app": {"@xmlns": "uri:oozie:workflow:0.4", "@name": "simple-Workflow", "start": {"@to": "Create_External_Table"}, "action": [{"@name": "Create_External_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/external.hive"}, "ok": {"@to": "Create_orc_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Create_orc_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/orc.hive"}, "ok": {"@to": "Insert_into_Table"}, "error": {"@to": "kill_job"}}, {"@name": "Insert_into_Table", "hive": {"@xmlns": "uri:oozie:hive-action:0.4", "job-tracker": "xyz.com:8088", "name-node": "hdfs://rootname", "script": "hdfs_path_of_script/Copydata.hive", "param": "database_name"}, "ok": {"@to": "end"}, "error": {"@to": "kill_job"}}], "kill": {"@name": "kill_job", "message": "Job failed"}, "end": {"@name": "end"}}}'
# replace strings that...
# * start with a "
# * then has '${'
# * capture non-greedy arbitrary number of characters with (.*?)
# * then has '}'
# * then ends with "
# Replace it with the capture in \1 and surround with quotes
subbed = re.sub(r'"\${(.*?)}"', r'"\1"', ins)
print(subbed == outs)
# this output True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.