[英]Transforming XML into JSON loadable structure for BigQuery
I'm learning python on the job and need help improving my solution.我正在工作中学习 Python,需要帮助改进我的解决方案。
I need to load XML data into BigQuery .我需要将 XML 数据加载到BigQuery 中。
I have it working but not sure if I have done it in a sensible way.我让它工作,但不确定我是否以明智的方式完成了它。
I call an API that returns an XML structure.我调用了一个返回 XML 结构的 API。 I use ElementTree to parse the XML and use tree.iter() to return the tags and text from the XML.我使用ElementTree解析 XML 并使用tree.iter()从 XML 返回标签和文本。 Printing my tags and text with:打印我的标签和文本:
for node in tree.iter():
print(f'{node.tag}, {node.text}')
Returns:返回:
Tag Text
Responses None
Response None
ResponseId 393
ResponseText Please respond “Has this loaded”
ResponseType single
ResponseStatus 0
The Responses tag appears only once per API call but Response through to ResponseStatus are repeating groups, ResponseId is the key for each group. Responses标签在每个 API 调用中只出现一次,但Response到ResponseStatus是重复的组, ResponseId是每个组的键。 Each call would return less than a 100 repeating groups.每次调用将返回少于 100 个重复组。
There is a key returned in the header, Response_key , that is the parent of all ResponseIds .标头中返回了一个键Response_key ,它是所有ResponseIds的父级。 My aim is to take this data, convert to JSON and stream to BigQuery.我的目标是获取这些数据,转换为 JSON 并流式传输到 BigQuery。
The table structure I need is:我需要的表结构是:
ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus
The approach I use is我使用的方法是
Use tree.iter() to loop and create a list使用 tree.iter() 循环创建列表
node_list = [] for node in tree.iter(): node_list.append(node.tag) node_list.append(node.text)
Use itertools to group the list (this I found a difficult step)使用 itertools 对列表进行分组(我发现这是一个困难的步骤)
r = 'Response ' response _split = [list(y) for x, y in itertools.groupby(node_list, lambda z: z == r) if not x]
which returns:返回:
[['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded”
"', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
It works but not sure if it is sensible.它有效,但不确定它是否明智。
Any suggested improvements would be appreciated.任何建议的改进将不胜感激。
Here is a sample of the XML:这是 XML 的示例:
{"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}
A sample JSON without all the processing steps:没有所有处理步骤的示例 JSON:
node_list = []
for node in tree.iter():
node_list.append(node.tag)
node_list.append(node.text)
json_format = json.dumps(node_list )
print(json_format)
["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]
I'm not sure what is the required outpot, this is one way of doing it我不确定所需的输出是什么,这是一种方法
import xml.etree.ElementTree as ET
import json
p = r"d:\tmp.xml"
tree = ET.parse(p)
root = tree.getroot()
json_dict = {}
json_dict[root.tag] = root.text
json_dict['response_list'] = []
for node in root:
tmp_dict = {}
for response_info in node:
tmp_dict[response_info.tag] = response_info.text
json_dict['response_list'].append(tmp_dict)
with open(r'd:\out.json', 'w') as of:
json.dump(json_dict, of)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.