简体   繁体   中英

Union of node and function on node in XPath

I am using Scrapy to crawl some webpages. I want to write an XPath query that will, within a parent <div> , append a couple of characters of text to any child <a> nodes, while extracting the text of the div's self node normally. Essentially it is like a normal descendant-or-self or // query, just written with | and calling the concat function on the descendants (which, if they exist, will be <a> tags).

These all return a value:

  1. my_div.xpath('div[@class="my_class"]/text()).extract()
  2. my_div.xpath('concat(\\'@\\', div[@class="my_class"]/a/text())').extract()
  3. my_div.xpath('div[@class="my_class"]/text() | div[@class="my_class"]/a/text()').extract()

However attempting to combine (1) and (2) above in the format of (3):

my_div.xpath('div[@class="my_class"]/text() | concat(\\'@\\', div[@class="my_class"]/a/text())').extract()

results in the following error:

ValueError: XPath error: Invalid type in div[@class="my_class"]/text() | concat('@', div[@class="my_class"]/a/text())

How do I get XPath to recognize the union of a node with a function called on a node?

I think it doesn't work because concat is doesn't actually return a path, and | is used to select multiple paths

By using the | operator in an XPath expression you can select several paths.

as per http://www.w3schools.com/xsl/xpath_syntax.asp

Why not just split it into two? Generally you use ItemLoaders with your spider. So you can simply add as many paths and/or values as you like.

mil = MyItemLoader(response=response)
mil.add_xpath('name', 'xpath1')
mil.add_xpath('name', 'xpath2')
mil.load_item()
# {'name': ['values_of_xpath1','values_of_xpath2']

If you want to preserve tree order you can try:

nodes = my_div.xpath('div[@class="my_class"]')
text = []
for node in nodes:
    text.append(node.xpath("text()").extract_first())
    text.append(node.xpath("a/text()").extract_first())
text = '@'.join(text)

You can probably simplify it with list comprehension but you get the idea: extract the nodes and iterate through nodes for both values.

In XPath 1.0, a location path returns a node-set . The concat function returns a string . The union operator | computes the union of its operands, which must be node-sets .

Update: this is what I did:

item['div_text'] = []
div_nodes = definition.xpath('div[@class="my_class"]/a | div[@class="my_class"]/text()')
for n in div_nodes:
    if n.xpath('self::a'):
        item['div_text'].append("@%s" % n.xpath('text()').extract_first())
    else:
        item['div_text'].append(n.extract())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM