[英]Transforming XML into JSON loadable structure for BigQuery
我正在工作中学习 Python,需要帮助改进我的解决方案。
我需要将 XML 数据加载到BigQuery 中。
我让它工作,但不确定我是否以明智的方式完成了它。
我调用了一个返回 XML 结构的 API。 我使用ElementTree解析 XML 并使用tree.iter()从 XML 返回标签和文本。 打印我的标签和文本:
for node in tree.iter():
print(f'{node.tag}, {node.text}')
返回:
Tag Text
Responses None
Response None
ResponseId 393
ResponseText Please respond “Has this loaded”
ResponseType single
ResponseStatus 0
Responses标签在每个 API 调用中只出现一次,但Response到ResponseStatus是重复的组, ResponseId是每个组的键。 每次调用将返回少于 100 个重复组。
标头中返回了一个键Response_key ,它是所有ResponseIds的父级。 我的目标是获取这些数据,转换为 JSON 并流式传输到 BigQuery。
我需要的表结构是:
ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus
我使用的方法是
使用 tree.iter() 循环创建列表
node_list = [] for node in tree.iter(): node_list.append(node.tag) node_list.append(node.text)
使用 itertools 对列表进行分组(我发现这是一个困难的步骤)
r = 'Response ' response _split = [list(y) for x, y in itertools.groupby(node_list, lambda z: z == r) if not x]
返回:
[['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded”
"', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
它有效,但不确定它是否明智。
任何建议的改进将不胜感激。
这是 XML 的示例:
{"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}
没有所有处理步骤的示例 JSON:
node_list = []
for node in tree.iter():
node_list.append(node.tag)
node_list.append(node.text)
json_format = json.dumps(node_list )
print(json_format)
["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]
我不确定所需的输出是什么,这是一种方法
import xml.etree.ElementTree as ET
import json
p = r"d:\tmp.xml"
tree = ET.parse(p)
root = tree.getroot()
json_dict = {}
json_dict[root.tag] = root.text
json_dict['response_list'] = []
for node in root:
tmp_dict = {}
for response_info in node:
tmp_dict[response_info.tag] = response_info.text
json_dict['response_list'].append(tmp_dict)
with open(r'd:\out.json', 'w') as of:
json.dump(json_dict, of)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.