[英]How to parse text from html using lxml?
<p>
Glassware veteran
<strong>Corning </strong>
(
<span class="ticker">
NYSE:
<a class="qsAdd qs-source-isssitthv0000001" href="http://caps.fool.com/Ticker/GLW.aspx?source=isssitthv0000001" data-id="203758">GLW</a>
</span>
<a class="addToWatchListIcon qsAdd qs-source-iwlsitbut0000010" href="http://my.fool.com/watchlist/add?ticker=&source=iwlsitbut0000010" title="Add to My Watchlist"> </a>
) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
</p>
I want to get "Glassware veteran" and "has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?" 我想成为“玻璃器皿老手”,并且“最近陷入困境。是时候该放弃股票了,还是康宁要香蕉和卷土重来?”
Using the code 使用代码
tnode = root.xpath("/p")
content = tnode.text
I can only get "Glassware veteran",why? 我只能得到“玻璃器皿老手”,为什么?
Something like this might get you what you want: 这样的事情可能会为您提供所需的东西:
>>> tnode = root.xpath('/p')
>>> content = tnode.xpath('text()')
>>> print ''.join(content)
Glassware veteran
(
) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
>>>
If you want all of the text nodes, just use //text()
instead of text()
: 如果要使用所有文本节点,只需使用
//text()
而不是text()
:
>>> print ' '.join([x.strip() for x in ele.xpath('//text()')])
Glassware veteran Corning ( NYSE: GLW ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.