简体   繁体   中英

Python XML Parsing Child Tag

I am trying to get the contents of a sub tag using lxml. The XML file I am parsing is valid but for some reason when I try and parse the child element it seems to think I have invalid XML. I have seen from other posts that this error is usually generated when there isn't a closing tag but the XML parses fine in a browser. Any ideas why this is happening ?

Contents of XML file (test.xml):

<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
    <title>SRG-OS-000257-GPOS-00098</title>
    <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt;   </description>
    <Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
      <version>RHEL-07-010010</version>
      <title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
      <description>&lt;VulnDiscussion&gt;Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108&lt;/VulnDiscussion&gt;&lt;FalsePositives&gt;&lt; /FalsePositives&gt;&lt;FalseNegatives&gt;&lt; /FalseNegatives&gt;&lt;Documentable&gt;false&lt; /Documentable&gt;&lt;Mitigations&gt;&lt; /Mitigations&gt;&lt;SecurityOverrideGuidance&gt;&lt; /SecurityOverrideGuidance&gt;&lt;PotentialImpacts&gt;&lt; /PotentialImpacts&gt;&lt;ThirdPartyTools&gt;&lt; /ThirdPartyTools&gt;&lt;MitigationControl&gt;&lt; /MitigationControl&gt;&lt;Responsibility&gt;&lt; /Responsibility&gt;&lt;IAControls&gt;&lt;/IAControls&gt;</description>
      <ident system="http://iase.disa.mil/cci">CCI-001494</ident>
      <ident system="http://iase.disa.mil/cci">CCI-001496</ident>
      <fixtext fixref="F-RHEL-07-010010_fix">Run the following command to  determine which package owns the file:

# rpm -qf &lt;filename&gt;

Reset the permissions of files within a package with the following command:

#rpm --setperms &lt;packagename&gt;

Reset the user and group ownership of files within a package with the following command:

#rpm --setugids &lt;packagename&gt;</fixtext>
      <fix id="F-RHEL-07-010010_fix" />
      <check system="C-RHEL-07-010010_chk">
        <check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
            <check-content>Verify the file permissions, ownership, and group  membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:

# rpm -Va | grep '^.M'

If there is any output from the command, this is a finding.</check-content>
      </check>
    </Rule>
  </Group>

I am trying to get the contents of the VulnDiscussion tag. I can get the contents of the parent tag, discussion like this:

from lxml import etree as ET

xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)

This produces the following output:

<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion>   <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>

So far so good, now I try and extract the contents of VulnDiscussion with this code:

for description in xml.xpath('//description/text()'):
    vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
    print(vulnDiscussion)

and get the following error :

 vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
  File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
  File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
  File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
  File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3,  column 79

XML can only have one "root", xml.xpath('//description/text()') return multiple elements. Wrap all elements in to a single element, then your XML document will only have one root element.

Also noted that the text in the original XML has a space before each closing tag that you should remove

from lxml import etree as ET

xml = ET.parse("test.xml")

    for description in xml.xpath('//description/text()'):
    x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
    vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
    if vulnDiscussion:
        print(vulnDiscussion)

Output

    Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

    Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM