繁体   English   中英

如何使用 python 在 xml 文件中搜索特定标签

[英]how to search a specific tag in xml file with python

我有一个非常大且复杂的 xml 文件,我想从中获取一个text_body 我需要跳过其他树和树枝,只得到它们看起来像这样的特定部分:

<req id="1">
    <text_body>
        Upon the USB being plugged in the system shall be able to be deployed and operational in less than 1 minute.
    </text_body>
</req>
<req id="2">
    <text_body>
    The system shall be able to handle 1000 customers logged in concurrently at the same time.
    </text_body>
</req>
<req id="CO-1">
    <text_body>
        Must use a SQL based database. SQL standard is the most widely used database format. Restricting to SQL allows easy of use and compatibility for Web Store.
    </text_body>
</req>
<req id="CO-2">
    <text_body>
        Compatibility is only tested and verified for Microsoft Internet Explorer version 6 and 7, Netscape Communicator Version 4 and 5. Other versions may not be 100&#37; compatible. Also other browsers such as Mozilla or Firefox may not be 100&#37; compatible.
    </text_body>
</req>
<req id="3">
    <text_body>
The system shall adhere to the following hardware requirements:
    <itemize>
        <item>4GB Flash ram chip</item>
        <item>128MB SDRAM</item>
        <item>Intel XScale PXA270 520-MHz chipset</item>
        <item>OS: Apache web server</item>
        <item>Database: MySQL</item>
    </itemize>
    </text_body>
</req>

我需要在text_body中获取字符串,但是如何编写我的代码,例如“返回带有任何 id 的字符串”。 如您所见,有不同的ID。 在最后一个中, text_body内还有一个我不需要的 itemsize。 有类似的问题,例如Q1Q2我试图从 therm 获得帮助,但他们没有返回我需要的东西。 我怎样才能做到这一点?

更新我需要一个 output 像这样:
要求1:第一个text_body
要求2:seconf text_body

这是你要找的吗?

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('test.xml').read(), features='lxml')
for text_body in soup.find_all('text_body')[:2]:
    print(text_body.get_text().strip())

Output

Upon the USB being plugged in the system shall be able to be deployed and operational in less than 1 minute.
The system shall be able to handle 1000 customers logged in concurrently at the same time.

您可以使用 Python 的内置库来处理xml文件:

import xml.etree.ElementTree as ET 

tree = ET.parse('your/xml_file.xml')
root = tree.getroot()
text_body_strings = [x.find('text_body').text for x in root.findall('req')]

您可能会发现需要对text_body_strings进行一些文本清理,但这是另一个主题。

可以在此处找到有关此 package 的文档。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM