[英]How can one replace an element in lxml?
I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format".我有一个文本(由 CRM 用户输入的数据)网络服务,它返回一个“可怕的格式”。 I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts.
我在使用数据之前使用 python 进行过滤,但是在删除换行符 (br) 时也删除了我的文本。 The code is as follows:
代码如下:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.我试图用 element.getparent().remove(element) 删除br ,但也删除了文本,我做了测试以查看文本是否属于任何节点,但并非如此。
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):我想过用 li 改变 br,在 ul 中用 stylo 制作 p,但我不能这样想,像这样(以前的文字蹩脚):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.我不能认为是采用文本,因为我认为只需选择具有样式及其值的节点 p 的 xpath,创建 li 和父级 ul 的节点子节点,就可以消除 p。
Is possible?有可能吗? Thanks
谢谢
Regards问候
You can use lxml.etree.strip_elements
, like so:您可以使用
lxml.etree.strip_elements
,如下所示:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.