简体   繁体   English

如何在 Python 中从 XML/SOAP 中提取数据

[英]How to extract data from XML/SOAP in Python

The UK National Gas system publishes a mass of data that can be access from a SOAP server and an example of the returned data (for LNG) is shown below. UK National Gas 系统发布了大量可从 SOAP 服务器访问的数据,下面显示了返回数据的示例(针对 LNG)。 I've written the code to generate the request and to handle the response but am tripping up on how to extract the returned information.我已经编写了生成请求和处理响应的代码,但我对如何提取返回的信息感到困惑。 The aim would be to upload the data into a backend database or into a Pandas dataframe.目的是将数据上传到后端数据库或 Pandas 数据框。

In previous code, I've simply traversed the XML using XPATH and then iterated over the tag and extracted out the child data.在前面的代码中,我只是使用 XPATH 遍历 XML,然后遍历标记并提取出子数据。 Thus, I was hoping to extract:因此,我希望提取:

GetPublicationDataWMResult, ApplicableAt, ApplicableFor, Value, ...
LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
LNG Capacity,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 6515042480, ...

Having tried to use XPATH to traverse the children (/Envelope/Body/GetPublicationDataWMResponse/GetPublicationDataWMResult/) it is failing.尝试使用 XPATH 遍历子项 (/Envelope/Body/GetPublicationDataWMResponse/GetPublicationDataWMResult/) 失败。

The logic works if I sanitize the code by adding a series of string removals but that's sub-optimal and bound to break in the future.如果我通过添加一系列字符串删除来清理代码,则该逻辑有效,但这是次优的,并且将来必定会中断。

EXAMPLE CODE:示例代码:

import requests
from lxml import objectify

def getXML():

    toDate = "2016-03-12"
    fromDate = "2016-03-12"
    dateType = "gasday"

    url="http://marketinformation.natgrid.co.uk/MIPIws-public/public/publicwebservice.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}

    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
        <soap12:Body>
            <GetPublicationDataWM xmlns="http://www.NationalGrid.com/MIPI/">
                <reqObject>
                    <LatestFlag>Y</LatestFlag>
                    <ApplicableForFlag>Y</ApplicableForFlag>
                    <ToDate>%s</ToDate>
                    <FromDate>%s</FromDate>
                    <DateType>%s</DateType>
                    <PublicationObjectNameList>
                        <string>LNG Stock Level</string>
                        <string>LNG, Daily Aggregated Available Capacity, D+1</string>
                    </PublicationObjectNameList>
                </reqObject>
            </GetPublicationDataWM>
        </soap12:Body>
    </soap12:Envelope>""" % (toDate, fromDate,dateType)


    response = requests.post(url,data=body,headers=headers)

    return response.content

root = objectify.fromstring(getXML())

Returned XML:返回的 XML:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
    xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <GetPublicationDataWMResponse
            xmlns="http://www.NationalGrid.com/MIPI/">
            <GetPublicationDataWMResult>
                <CLSMIPIPublicationObjectBE>
                    <PublicationObjectName>LNG Stock Level</PublicationObjectName>
                    <PublicationObjectData>
                        <CLSPublicationObjectDataBE>
                            <ApplicableAt>2016-03-13T15:00:07Z</ApplicableAt>
                            <ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
                            <Value>7050.42286</Value>
                            <GeneratedTimeStamp>2016-03-13T15:56:00Z</GeneratedTimeStamp>
                            <QualityIndicator></QualityIndicator>
                            <Substituted>N</Substituted>
                            <CreatedDate>2016-03-13T15:56:28Z</CreatedDate>
                        </CLSPublicationObjectDataBE>
                    </PublicationObjectData>
                </CLSMIPIPublicationObjectBE>
                <CLSMIPIPublicationObjectBE>
                    <PublicationObjectName>LNG Capacity</PublicationObjectName>
                    <PublicationObjectData>
                        <CLSPublicationObjectDataBE>
                            <ApplicableAt>2016-03-12T15:30:00Z</ApplicableAt>
                            <ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
                            <Value>6515042480</Value>
                            <GeneratedTimeStamp>2016-03-12T16:00:00Z</GeneratedTimeStamp>
                            <QualityIndicator></QualityIndicator>
                            <Substituted>N</Substituted>
                            <CreatedDate>2016-03-12T16:00:20Z</CreatedDate>
                        </CLSPublicationObjectDataBE>
                    </PublicationObjectData>
                </CLSMIPIPublicationObjectBE>
            </GetPublicationDataWMResult>
        </GetPublicationDataWMResponse>
    </soap:Body>
</soap:Envelope>

Using your existing code I just added this:使用您现有的代码,我刚刚添加了这个:

res= getXML()

from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')

searchTerms= ['PublicationObjectName','ApplicableAt','ApplicableFor','Value']
# LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...

for st in searchTerms:
    print st+'\t',
    print soup.find(st.lower()).contents[0]

Output:输出:

PublicationObjectName   LNG Stock Level
ApplicableAt    2016-03-13T15:00:07Z
ApplicableFor   2016-03-12T00:00:00Z
Value   7050.42286

This is a FAQ in XML+XPath topic that involves XML with default namespace .这是 XML+XPath 主题中的常见问题解答,涉及带有默认名称空间的XML。

XML element where default namespace is declared and its descendant elements without prefix inherits the same default namespace implicitly.声明默认命名空间的 XML 元素及其不带前缀的后代元素隐式继承相同的默认命名空间。 And in XPath expression, to reference element in namespace you need to use prefix that has been mapped to the corresponding namespace URI.在 XPath 表达式中,要引用命名空间中的元素,您需要使用已映射到相应命名空间 URI 的前缀。 Using lxml the codes will be about like the following :使用lxml代码将类似于以下内容:

root = etree.fromstring(getXML())

# map prefix 'd' to the default namespace URI
ns = { 'd': 'http://www.NationalGrid.com/MIPI/'}

publication_objects = root.xpath('//d:CLSMIPIPublicationObjectBE', namespaces=ns)
for obj in publication_objects:
    name = obj.find('d:PublicationObjectName', ns).text

    data = obj.find('d:PublicationObjectData/d:CLSPublicationObjectDataBE', ns)
    applicable_at = data.find('d:ApplicableAt', ns).text
    applicable_for = data.find('d:ApplicableFor', ns).text
    # todo: extract other relevant data and process as needed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM