繁体   English   中英

Python XML 解析子标签

[英]Python XML Parsing Child Tag

我正在尝试使用 lxml 获取子标签的内容。 我正在解析的 XML 文件是有效的,但是由于某种原因,当我尝试解析子元素时,它似乎认为我的 XML 无效。 我从其他帖子中看到,当没有结束标记但 XML 在浏览器中解析良好时,通常会生成此错误。 任何想法为什么会发生这种情况?

XML 文件 (test.xml) 的内容:

<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
    <title>SRG-OS-000257-GPOS-00098</title>
    <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt;   </description>
    <Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
      <version>RHEL-07-010010</version>
      <title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
      <description>&lt;VulnDiscussion&gt;Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108&lt;/VulnDiscussion&gt;&lt;FalsePositives&gt;&lt; /FalsePositives&gt;&lt;FalseNegatives&gt;&lt; /FalseNegatives&gt;&lt;Documentable&gt;false&lt; /Documentable&gt;&lt;Mitigations&gt;&lt; /Mitigations&gt;&lt;SecurityOverrideGuidance&gt;&lt; /SecurityOverrideGuidance&gt;&lt;PotentialImpacts&gt;&lt; /PotentialImpacts&gt;&lt;ThirdPartyTools&gt;&lt; /ThirdPartyTools&gt;&lt;MitigationControl&gt;&lt; /MitigationControl&gt;&lt;Responsibility&gt;&lt; /Responsibility&gt;&lt;IAControls&gt;&lt;/IAControls&gt;</description>
      <ident system="http://iase.disa.mil/cci">CCI-001494</ident>
      <ident system="http://iase.disa.mil/cci">CCI-001496</ident>
      <fixtext fixref="F-RHEL-07-010010_fix">Run the following command to  determine which package owns the file:

# rpm -qf &lt;filename&gt;

Reset the permissions of files within a package with the following command:

#rpm --setperms &lt;packagename&gt;

Reset the user and group ownership of files within a package with the following command:

#rpm --setugids &lt;packagename&gt;</fixtext>
      <fix id="F-RHEL-07-010010_fix" />
      <check system="C-RHEL-07-010010_chk">
        <check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
            <check-content>Verify the file permissions, ownership, and group  membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:

# rpm -Va | grep '^.M'

If there is any output from the command, this is a finding.</check-content>
      </check>
    </Rule>
  </Group>

我正在尝试获取 VulnDiscussion 标签的内容。 我可以得到父标签的内容,讨论如下:

from lxml import etree as ET

xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)

这会产生以下输出:

<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion>   <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>

到目前为止一切顺利,现在我尝试使用以下代码提取 VulnDiscussion 的内容:

for description in xml.xpath('//description/text()'):
    vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
    print(vulnDiscussion)

并得到以下错误:

 vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
  File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
  File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
  File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
  File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3,  column 79

XML 只能有一个“根”,xml.xpath('//description/text()') 返回多个元素。 将所有元素包装到一个元素中,那么您的 XML 文档将只有一个根元素。

还请注意,原始 XML 中的文本在您应该删除的每个结束标记之前都有一个空格

from lxml import etree as ET

xml = ET.parse("test.xml")

    for description in xml.xpath('//description/text()'):
    x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
    vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
    if vulnDiscussion:
        print(vulnDiscussion)

输出

    Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

    Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM