I am using Scrapy to crawl some webpages. I want to write an XPath query that will, within a parent <div>
, append a couple of characters of text to any child <a>
nodes, while extracting the text of the div's self
node normally. Essentially it is like a normal descendant-or-self
or //
query, just written with |
and calling the concat
function on the descendants (which, if they exist, will be <a>
tags).
These all return a value:
my_div.xpath('div[@class="my_class"]/text()).extract()
my_div.xpath('concat(\\'@\\', div[@class="my_class"]/a/text())').extract()
my_div.xpath('div[@class="my_class"]/text() | div[@class="my_class"]/a/text()').extract()
However attempting to combine (1) and (2) above in the format of (3):
my_div.xpath('div[@class="my_class"]/text() | concat(\\'@\\', div[@class="my_class"]/a/text())').extract()
results in the following error:
ValueError: XPath error: Invalid type in div[@class="my_class"]/text() | concat('@', div[@class="my_class"]/a/text())
How do I get XPath to recognize the union of a node with a function called on a node?
I think it doesn't work because concat is doesn't actually return a path, and |
is used to select multiple paths
By using the | operator in an XPath expression you can select several paths.
as per http://www.w3schools.com/xsl/xpath_syntax.asp
Why not just split it into two? Generally you use ItemLoaders with your spider. So you can simply add as many paths and/or values as you like.
mil = MyItemLoader(response=response)
mil.add_xpath('name', 'xpath1')
mil.add_xpath('name', 'xpath2')
mil.load_item()
# {'name': ['values_of_xpath1','values_of_xpath2']
If you want to preserve tree order you can try:
nodes = my_div.xpath('div[@class="my_class"]')
text = []
for node in nodes:
text.append(node.xpath("text()").extract_first())
text.append(node.xpath("a/text()").extract_first())
text = '@'.join(text)
You can probably simplify it with list comprehension but you get the idea: extract the nodes and iterate through nodes for both values.
In XPath 1.0, a location path returns a node-set . The concat
function returns a string . The union operator |
computes the union of its operands, which must be node-sets .
Update: this is what I did:
item['div_text'] = []
div_nodes = definition.xpath('div[@class="my_class"]/a | div[@class="my_class"]/text()')
for n in div_nodes:
if n.xpath('self::a'):
item['div_text'].append("@%s" % n.xpath('text()').extract_first())
else:
item['div_text'].append(n.extract())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.