简体   繁体   English

将 XML 转换为 BigQuery 的 JSON 可加载结构

[英]Transforming XML into JSON loadable structure for BigQuery

I'm learning python on the job and need help improving my solution.我正在工作中学习 Python,需要帮助改进我的解决方案。

I need to load XML data into BigQuery .我需要将 XML 数据加载到BigQuery 中

I have it working but not sure if I have done it in a sensible way.我让它工作,但不确定我是否以明智的方式完成了它。

I call an API that returns an XML structure.我调用了一个返回 XML 结构的 API。 I use ElementTree to parse the XML and use tree.iter() to return the tags and text from the XML.我使用ElementTree解析 XML 并使用tree.iter()从 XML 返回标签和文本。 Printing my tags and text with:打印我的标签和文本:

for node in tree.iter():
    print(f'{node.tag}, {node.text}')

Returns:返回:

Tag              Text
Responses        None
Response         None
ResponseId       393
ResponseText     Please respond “Has this loaded” 
ResponseType     single
ResponseStatus   0

The Responses tag appears only once per API call but Response through to ResponseStatus are repeating groups, ResponseId is the key for each group. Responses标签在每个 API 调用中只出现一次,但ResponseResponseStatus是重复的组, ResponseId是每个组的键。 Each call would return less than a 100 repeating groups.每次调用将返回少于 100 个重复组。

There is a key returned in the header, Response_key , that is the parent of all ResponseIds .标头中返回了一个键Response_key ,它是所有ResponseIds的父级。 My aim is to take this data, convert to JSON and stream to BigQuery.我的目标是获取这些数据,转换为 JSON 并流式传输到 BigQuery。

The table structure I need is:我需要的表结构是:

ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus

The approach I use is我使用的方法是

  1. Use tree.iter() to loop and create a list使用 tree.iter() 循环创建列表

    node_list = [] for node in tree.iter(): node_list.append(node.tag) node_list.append(node.text)
  2. Use itertools to group the list (this I found a difficult step)使用 itertools 对列表进行分组(我发现这是一个困难的步骤)

     r = 'Response ' response _split = [list(y) for x, y in itertools.groupby(node_list, lambda z: z == r) if not x]

which returns:返回:

[['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded” 
"', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
  1. Load into a Pandas data frame, remove any double quotes in case that causes BigQuery any issues.加载到 Pandas 数据框中,删除所有双引号,以防导致 BigQuery 出现任何问题。
  2. Add ResponseKey as a column to the dataframe.将 ResponseKey 作为列添加到数据框中。
  3. Convert data frame to JSON and pass to load_table_from_json .将数据帧转换为 JSON 并传递给load_table_from_json

It works but not sure if it is sensible.它有效,但不确定它是否明智。

Any suggested improvements would be appreciated.任何建议的改进将不胜感激。

Here is a sample of the XML:这是 XML 的示例:

{"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your  datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your  datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}

A sample JSON without all the processing steps:没有所有处理步骤的示例 JSON:

node_list = []
for node in tree.iter():
    node_list.append(node.tag)
    node_list.append(node.text)

json_format = json.dumps(node_list )
print(json_format)


["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]

I'm not sure what is the required outpot, this is one way of doing it我不确定所需的输出是什么,这是一种方法

import xml.etree.ElementTree as ET
import json

p = r"d:\tmp.xml"
tree = ET.parse(p)

root = tree.getroot()

json_dict = {}

json_dict[root.tag] = root.text

json_dict['response_list'] = []


for node in root:
    tmp_dict = {}
    for response_info in node:
        tmp_dict[response_info.tag] = response_info.text
    json_dict['response_list'].append(tmp_dict)

with open(r'd:\out.json', 'w') as of:
    json.dump(json_dict, of)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM