[英]XPATH - how to get inner text data littered with <br> tags?
I have HTML text like this 我有这样的HTML文本
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
<othertag>
data
</othertag>
<moretag>
data
</moretag>
I'm trying to query the following with XPATH 我正在尝试使用XPATH查询以下内容
//p//text() | //othertag//text() | //moretag//text()
which gives me text which is broken at the point of each <br>
tag 这给我的文本在每个
<br>
标记处都被打断了
like this 像这样
('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')
I'd like it as a complete string, 我希望它是一个完整的字符串,
('This is some important data Even this is data this is useful too')
because i'll be querying other elements using |
因为我将使用
|
查询其他元素 Union XPATH operators and its very important this text content is properly divided 联合XPATH运算符及其非常重要的一点是,正确划分此文本内容
How can i do this? 我怎样才能做到这一点?
If this is impossible, 如果这不可能
can i atleast get the inner HTML of <p>
somehow 我可以至少以某种方式获取
<p>
的内部HTML吗
So that i can textually store it as 这样我就可以将其文本存储为
This is some important data<br>Even this is data<br>this is useful too
I'm using lxml.html
in Python 2.7
我在
Python 2.7
使用lxml.html
Update 更新
Based on your edit, maybe you can use the XPath string()
function. 根据您的编辑,也许您可以使用XPath
string()
函数。 For example: 例如:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '
(original answer follows) (以下为原始答案)
If you're getting back the text you want in multiple pieces: 如果要取回多个文本,请:
('This is some important data','Even this is data','this is useful too')
Why not just join those pieces? 为什么不加入这些作品?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']
You can even get rid of the line breaks: 您甚至可以摆脱换行符:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'
If you wanted the "inner html" of the p
element, you could call lxml.etree.tostring
on all of it's children: 如果希望使用
p
元素的“内部html”,则可以在所有子元素上调用lxml.etree.tostring
:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '
NB: All of these examples assume: 注意:所有这些示例均假定:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())
You can also expose your own functions in XPath: 您还可以在XPath中公开自己的函数:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))
which prints 哪个打印
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'
You can perform the transformations however you like using this method. 您可以根据需要使用此方法执行转换。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.