Get both parent and child text with Xpath (HtmlXPathSelector)

Question

I am scraping a website, and I need to get the numerical values from this HTMLdocument:

<td>
<span style=" color: red; font-weight: bold;"> 1.950</span>
</td>
<td> 3.400</td>

I need to extract both 1.950 and 3.400, but I can't figure out how to do it, when the one value is only in a , but the other one has a span as well. Is there a general way to get both the parent and the child of the path? I am using the scrapy framework with the HtmlXPathSelector . I can use the path /td/text() for one, and /td/span/text() for the other, but I need to do it in one query. How can this be achieved?

Answer 1

您可以尝试使用： /td//text()来选择作为td后代的每个文本节点

Answer 2

I think you have two ways to solve the issue.

With Xpath

following-sibling::node()

and the other is to iterate all tds (but this could be nasty)

I will give you an example with Xpath

span_text = hxs.select("/td/span/text()")
next = span_text.select('following-sibling::node()') #you should get 3.400 (or with this idea :P)

if you have this xml:

<?xml version="1.0" encoding="UTF-8"?>

<root>
  <td> 
    <span style=" color: red; font-weight: bold;">1.950</span> 
  </td>
  <td>3.400</td>
</root>

and you execute this xpath expression :

//td/following-sibling::node()

you will get 3.400

this is a good place to test xpath

Answer 3

You can try this

.select("string()").extract()

It will extract all text without any html tags

Get both parent and child text with Xpath (HtmlXPathSelector)

Question

3 answers

solution1
4 ACCPTED 2013-01-12 23:46:43

solution2
2 2013-01-12 23:50:15

solution3
1 2013-01-14 08:33:32

Get both parent and child text with Xpath (HtmlXPathSelector)

Question

3 answers

solution1 4 ACCPTED 2013-01-12 23:46:43

solution2 2 2013-01-12 23:50:15

solution3 1 2013-01-14 08:33:32

solution1
4 ACCPTED 2013-01-12 23:46:43

solution2
2 2013-01-12 23:50:15

solution3
1 2013-01-14 08:33:32