python元素树xml解析

Question

I am using Python element tree to parse xml file 我正在使用Python元素树来解析xml文件

lets say i have an xml file like this .. 可以说我有一个像这样的xml文件..

<html>
<head>
    <title>Example page</title>
</head>
<body>
    <p>hello this is first paragraph </p>
    <p> hello this is second paragraph</p>
</body>
</html>

is there any way i can extract the body with the p tags intact like 有什么办法可以像我一样完整地提取p标签吗？

desired= "<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>"

Answer 1

The following code does the trick. 以下代码可以解决问题。

import xml.etree.ElementTree as ET

root = ET.fromstring(doc)  # doc is a string containing the example file
body = root.find('body')
desired = ' '.join([ET.tostring(c).strip() for c in body.getchildren()])

Now: 现在：

>>> desired
'<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>'

Answer 2

You can use lxml library, lxml 您可以使用lxml库， lxml

So, this code will help you. 因此，此代码将为您提供帮助。

import lxml.html

htmltree = lxml.html.parse('''
<html>
<head>
<title>Example page</title>
</head>
 <body>
<p>hello this is first paragraph </p>
<p> hello this is second paragraph</p>
</body>
</html>''')
p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]

print p_content

Answer 3

A slightly different way to @DavidAlber, where the children could easily be selected: 与@DavidAlber略有不同的方式，可以在其中轻松选择孩子：

from xml.etree import ElementTree

tree = ElementTree.parse("example.xml")
body = tree.findall("/body/p")

result = []
for elem in body:
     result.append(ElementTree.tostring(elem).strip())

print " ".join(result)

python元素树xml解析

问题描述

3 个解决方案

解决方案1
1 已采纳 2012-11-16 07:54:25

解决方案2
0 2012-11-16 07:54:12

解决方案3
0 2012-11-16 08:09:49

python元素树xml解析

问题描述

3 个解决方案

解决方案1 1 已采纳 2012-11-16 07:54:25

解决方案2 0 2012-11-16 07:54:12

解决方案3 0 2012-11-16 08:09:49

解决方案1
1 已采纳 2012-11-16 07:54:25

解决方案2
0 2012-11-16 07:54:12

解决方案3
0 2012-11-16 08:09:49