I am scraping a website, and I need to get the numerical values from this HTMLdocument:
<td>
<span style=" color: red; font-weight: bold;"> 1.950</span>
</td>
<td> 3.400</td>
I need to extract both 1.950 and 3.400, but I can't figure out how to do it, when the one value is only in a , but the other one has a span as well. Is there a general way to get both the parent and the child of the path? I am using the scrapy
framework with the HtmlXPathSelector
. I can use the path /td/text()
for one, and /td/span/text()
for the other, but I need to do it in one query. How can this be achieved?
您可以尝试使用: /td//text()
来选择作为td
后代的每个文本节点
I think you have two ways to solve the issue.
With Xpath
following-sibling::node()
and the other is to iterate all tds (but this could be nasty)
I will give you an example with Xpath
span_text = hxs.select("/td/span/text()")
next = span_text.select('following-sibling::node()') #you should get 3.400 (or with this idea :P)
if you have this xml:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<td>
<span style=" color: red; font-weight: bold;">1.950</span>
</td>
<td>3.400</td>
</root>
and you execute this xpath expression :
//td/following-sibling::node()
you will get 3.400
You can try this
.select("string()").extract()
It will extract all text without any html tags
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.