简体   繁体   中英

Get both parent and child text with Xpath (HtmlXPathSelector)

I am scraping a website, and I need to get the numerical values from this HTMLdocument:

<td>
<span style=" color: red; font-weight: bold;"> 1.950</span>
</td>
<td> 3.400</td>

I need to extract both 1.950 and 3.400, but I can't figure out how to do it, when the one value is only in a , but the other one has a span as well. Is there a general way to get both the parent and the child of the path? I am using the scrapy framework with the HtmlXPathSelector . I can use the path /td/text() for one, and /td/span/text() for the other, but I need to do it in one query. How can this be achieved?

您可以尝试使用: /td//text()来选择作为td后代的每个文本节点

I think you have two ways to solve the issue.

With Xpath

following-sibling::node()

and the other is to iterate all tds (but this could be nasty)

I will give you an example with Xpath

span_text = hxs.select("/td/span/text()")
next = span_text.select('following-sibling::node()') #you should get 3.400 (or with this idea :P)

if you have this xml:

<?xml version="1.0" encoding="UTF-8"?>

<root>
  <td> 
    <span style=" color: red; font-weight: bold;">1.950</span> 
  </td>
  <td>3.400</td>
</root>

and you execute this xpath expression :

//td/following-sibling::node()

you will get 3.400

this is a good place to test xpath

You can try this

.select("string()").extract()

It will extract all text without any html tags

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM