myList = tree.xpath('//div[@id="RM1127"]/div[@class="moreInfo"]/text()')
I'm scraping a website for elements inside this div. It works fine but on this one div, there's a <b>
tag. myList returns elements for that div as two separate elements.
<div class="moreInfo" style="display:none;font-weight:normal; font-size:14px; margin-top:6px; padding:0px 0 0 30px;">
Over ½ lb. of jumbo shrimp fried golden crisp in a…
<br></br>
coleslaw, cocktail & Tartar sauce. …
</div>
The html looks like this. Instead of having 'Over ½ lb. of jumbo shrimp fried golden crisp in a' and 'coleslaw, cocktail & Tartar sauce' together as one element, I'm getting them both as separate elements in an array.
Using Python XPath + LXML
, just invoke HtmlElement.text_content()
. Take a look on this full exemple:
from lxml import etree
import lxml.html
html ="""<!DOCTYPE html>
<html>
<body>
<div id="RM1127">
<div class="moreInfo" style="">
Over 1/2 lb. of jumbo shrimp fried golden crisp in a...
<br>
coleslaw, cocktail & Tartar sauce
</div>
</div>
</body>
</html>"""
dom = lxml.html.fromstring(html)
tags = dom.xpath("""//div[@id="RM1127"]/div[@class="moreInfo"]""")
for e in tags:
print(e.text_content())
From doc :
Returns the text content of the element, including the text content of its children, with no markup.
Try the following XPath expression:
string(//div[@id="RM1127"]/div[@class="moreInfo"])
When applied to a node-set, the XPath string function returns the string-value of the node that is first in document order. The string-value of an element node is the concatenation of the string-values of all text node descendants .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.