简体   繁体   中英

How to get all xpaths that are matching given regex?

Is there any python library which facilitates in getting xpaths of dom nodes which matches the given regex?

I am trying to fetch question and answer pair from a faq page

these are three different xpaths of questions from this site

xpath1: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[7]/div[1]/a/span
xpath2: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[10]/div[1]/a/span
xpath3: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div[1]/div[1]/div[1]/a/span

now let the regex be something like this :

/html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/ * / * / * /div[1]/a/span

is it possible to get all xpaths that satisfy the regex we build through some library in python?

I tried using scrapy selectors to fetch all questions but it is failing while fetching the answers, so i want to go through all questions and then fetch their answers, for this I want question Xpaths

You don't need a tool or regex (as well as absolute XPath expressions). Try to use below XPath to match all questions on page:

//div[@class="ClsInnerDrop"]/a

If you don't know how to write your own selectors, check this cheatsheet

Finally, I found the solution for this, with the combination of lxml and scrapy. used @Andersson answer to find all the text content using the selector and then for each text, iterated over the tree and used tree.getpath() from lxml

The solution is not regex based but solved my use-case, so posting it

import requests
from lxml import html

def get_xpath_for_text(tree, text):
 try:
    for tag in tree.iter():
        if tag.text and tag.text == text:
            return tree.getpath(tag)
    return ' '
 except Exception as e:
    return ' '

 webpage = requests.get(url)
 html_content = html.fromstring(webpage.text)
 tree= html_content.getroottree()
 get_xpath_for_text(tree, text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM