简体   繁体   中英

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
    <codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
        <docDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title</titl>
                </titlStmt>
                <prodStmt>
                    <prodDate/>
                </prodStmt>
            </citation>
        </docDscr>
        <stdyDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title 2</titl>
                    <IDNo agency="UKDA">101</IDNo>
                </titlStmt>
                <rspStmt>
                    <AuthEnty>TestAuthEntry</AuthEnty>
                </rspStmt>
                <prodStmt>
                    <copyright>Yes</copyright>
                </prodStmt>
                <distStmt/>
                <verStmt>
                    <version date="">1</version>
                </verStmt>
            </citation>
            <stdyInfo>
                <subject>
                    <keyword>2009</keyword>
                    <keyword>2010</keyword>
                    <topcClas>CLASS</topcClas>
                    <topcClas>ffdsf</topcClas>
                </subject>
                <abstract>This is an abstract piece of text.</abstract>
                <sumDscr>
                    <timePrd event="single">2020</timePrd>
                    <nation>UK</nation>
                    <anlyUnit>Test</anlyUnit>
                    <universe>test</universe>
                    <universe>hello</universe>
                    <dataKind>fdsfdsf</dataKind>
                </sumDscr>
            </stdyInfo>
            <method>
                <dataColl>
                    <timeMeth>test timemeth</timeMeth>
                    <dataCollector>test data collector</dataCollector>
                    <sampProc>test sampprocess</sampProc>
                    <deviat>test deviat</deviat>
                    <collMode>test collMode</collMode>
                    <sources/>
                </dataColl>
            </method>
            <dataAccs>
                <setAvail>
                    <accsPlac>Test accsPlac</accsPlac>
                </setAvail>
                <useStmt>
                    <restrctn>NONE</restrctn>
                </useStmt>
            </dataAccs>
            <othrStdyMat>
                <relPubl>122</relPubl>
                <relPubl>12332</relPubl>
            </othrStdyMat>
        </stdyDscr>
    </codeBook>
</metadata>

I wrote the following code to try and process it:

from lxml import etree
import pdb

f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()

xml_doc = etree.fromstring(xml_str)

f.close()

From what I understand from the lxml xpath docs , I should be able to get the text from a specific element as follows:

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

However, when I run this it returns an empty array.

The only xpath I can get to return something is using a wildcard:

xml_doc.xpath('*')

Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>] .

I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.

You need to take the default namespace into account so instead of

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

use

xml_doc.xpath.xpath(
    '/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
    namespaces={
        'oai': 'http://www.openarchives.org/OAI/2.0/', 
        'ddi': 'ddi:codebook:2_5'
    }
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM