简体   繁体   English

Python:提取除某些标签外的 XML 文本

[英]Python: Extract XML texts except under certain tags

I have this example XML file:我有这个示例 XML 文件:

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>

This XML may have multiple levels (ie more than 2) and may have other tags.这个 XML 可能有多个级别(即超过 2 个)并且可能有其他标签。 I wish to extract all texts except those under the tag "content", so that I get a list of strings as follows:我希望提取除“内容”标签下的文本之外的所有文本,以便获得如下字符串列表:

['Chapter 1', 'John Smith', 'Chapter 2', 'John Doe']

I'm implementing this task using ElementTree.我正在使用 ElementTree 执行此任务。 Is there any elegant, clean solution?有没有优雅、干净的解决方案?

import bs4

xml = '''<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>'''

soup = bs4.BeautifulSoup(xml, 'lxml')
[(page.title.text, page.author.text)for page in soup('page')]

out:出去:

[('Chapter 1', 'John Smith'), ('Chapter 2', 'John Doe')]

Use BeautifulSoup as XML parser, you can reference Document使用 BeautifulSoup 作为 XML 解析器,可以参考Document

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM