繁体   English   中英

将 XML 转换为 BigQuery 的 JSON 可加载结构

[英]Transforming XML into JSON loadable structure for BigQuery

我正在工作中学习 Python,需要帮助改进我的解决方案。

我需要将 XML 数据加载到BigQuery 中

我让它工作,但不确定我是否以明智的方式完成了它。

我调用了一个返回 XML 结构的 API。 我使用ElementTree解析 XML 并使用tree.iter()从 XML 返回标签和文本。 打印我的标签和文本:

for node in tree.iter():
    print(f'{node.tag}, {node.text}')

返回:

Tag              Text
Responses        None
Response         None
ResponseId       393
ResponseText     Please respond “Has this loaded” 
ResponseType     single
ResponseStatus   0

Responses标签在每个 API 调用中只出现一次,但ResponseResponseStatus是重复的组, ResponseId是每个组的键。 每次调用将返回少于 100 个重复组。

标头中返回了一个键Response_key ,它是所有ResponseIds的父级。 我的目标是获取这些数据,转换为 JSON 并流式传输到 BigQuery。

我需要的表结构是:

ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus

我使用的方法是

  1. 使用 tree.iter() 循环创建列表

    node_list = [] for node in tree.iter(): node_list.append(node.tag) node_list.append(node.text)
  2. 使用 itertools 对列表进行分组(我发现这是一个困难的步骤)

     r = 'Response ' response _split = [list(y) for x, y in itertools.groupby(node_list, lambda z: z == r) if not x]

返回:

[['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded” 
"', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
  1. 加载到 Pandas 数据框中,删除所有双引号,以防导致 BigQuery 出现任何问题。
  2. 将 ResponseKey 作为列添加到数据框中。
  3. 将数据帧转换为 JSON 并传递给load_table_from_json

它有效,但不确定它是否明智。

任何建议的改进将不胜感激。

这是 XML 的示例:

{"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your  datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your  datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}

没有所有处理步骤的示例 JSON:

node_list = []
for node in tree.iter():
    node_list.append(node.tag)
    node_list.append(node.text)

json_format = json.dumps(node_list )
print(json_format)


["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]

我不确定所需的输出是什么,这是一种方法

import xml.etree.ElementTree as ET
import json

p = r"d:\tmp.xml"
tree = ET.parse(p)

root = tree.getroot()

json_dict = {}

json_dict[root.tag] = root.text

json_dict['response_list'] = []


for node in root:
    tmp_dict = {}
    for response_info in node:
        tmp_dict[response_info.tag] = response_info.text
    json_dict['response_list'].append(tmp_dict)

with open(r'd:\out.json', 'w') as of:
    json.dump(json_dict, of)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM