简体   繁体   English

具有LXML元素的XPath

[英]XPath with LXML Element

I am trying to parse an XML document using lxml etree. 我正在尝试使用lxml etree解析XML文档。 The XML doc I am parsing looks like this: 我正在解析的XML文档如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
    <codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
        <docDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title</titl>
                </titlStmt>
                <prodStmt>
                    <prodDate/>
                </prodStmt>
            </citation>
        </docDscr>
        <stdyDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title 2</titl>
                    <IDNo agency="UKDA">101</IDNo>
                </titlStmt>
                <rspStmt>
                    <AuthEnty>TestAuthEntry</AuthEnty>
                </rspStmt>
                <prodStmt>
                    <copyright>Yes</copyright>
                </prodStmt>
                <distStmt/>
                <verStmt>
                    <version date="">1</version>
                </verStmt>
            </citation>
            <stdyInfo>
                <subject>
                    <keyword>2009</keyword>
                    <keyword>2010</keyword>
                    <topcClas>CLASS</topcClas>
                    <topcClas>ffdsf</topcClas>
                </subject>
                <abstract>This is an abstract piece of text.</abstract>
                <sumDscr>
                    <timePrd event="single">2020</timePrd>
                    <nation>UK</nation>
                    <anlyUnit>Test</anlyUnit>
                    <universe>test</universe>
                    <universe>hello</universe>
                    <dataKind>fdsfdsf</dataKind>
                </sumDscr>
            </stdyInfo>
            <method>
                <dataColl>
                    <timeMeth>test timemeth</timeMeth>
                    <dataCollector>test data collector</dataCollector>
                    <sampProc>test sampprocess</sampProc>
                    <deviat>test deviat</deviat>
                    <collMode>test collMode</collMode>
                    <sources/>
                </dataColl>
            </method>
            <dataAccs>
                <setAvail>
                    <accsPlac>Test accsPlac</accsPlac>
                </setAvail>
                <useStmt>
                    <restrctn>NONE</restrctn>
                </useStmt>
            </dataAccs>
            <othrStdyMat>
                <relPubl>122</relPubl>
                <relPubl>12332</relPubl>
            </othrStdyMat>
        </stdyDscr>
    </codeBook>
</metadata>

I wrote the following code to try and process it: 我编写了以下代码来尝试和处理它:

from lxml import etree
import pdb

f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()

xml_doc = etree.fromstring(xml_str)

f.close()

From what I understand from the lxml xpath docs , I should be able to get the text from a specific element as follows: 根据我对lxml xpath docs了解 ,我应该能够从特定元素中获取文本,如下所示:

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

However, when I run this it returns an empty array. 但是,当我运行它时,它返回一个空数组。

The only xpath I can get to return something is using a wildcard: 我可以返回的唯一xpath是使用通配符:

xml_doc.xpath('*')

Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>] . 它返回[<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>]

I've read through the docs and I'm not understanding what is going wrong with this. 我已经阅读了文档,但不了解这是怎么回事。 Any help is appreciated. 任何帮助表示赞赏。

You need to take the default namespace into account so instead of 您需要考虑默认名称空间,而不是

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

use 采用

xml_doc.xpath.xpath(
    '/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
    namespaces={
        'oai': 'http://www.openarchives.org/OAI/2.0/', 
        'ddi': 'ddi:codebook:2_5'
    }
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM