简体   繁体   English

通过Xpath循环错误Selenium Python获取元素

[英]Get Element By Xpath Loop Error Selenium Python

I'm trying to make a web scraper for Pinterest. 我正在尝试为Pinterest制作网络抓取工具。 I'm able to get almost all the data, but each pin has a button called "see more" which generates: 'board name' and 'author name' data. 我几乎可以获取所有数据,但是每个引脚都有一个名为“查看更多”的按钮,该按钮生成:“板名”和“作者名”数据。

Logic: 逻辑:

  1. Saved all the button elements in array 将所有按钮元素保存在数组中
  2. Loop through them and clicked each button 遍历它们并单击每个按钮
  3. Got total number of pins on page 在页面上获得了总针数
  4. Looped against number of pins to find 'board name' by incrementing xpath 通过增加xpath来针对引脚数进行循环以查找“板名”

Button Click Loop Code: 按钮单击循环代码:

moreButtons = driver.find_elements_by_xpath('//button[@data-test-id="seemoretoggle"]')
    for moreBtn in moreButtons:
        moreBtn.click()

    source_data = driver.page_source

Get Board Name Code 获取董事会名称代码

# Pin Length - Total Pins
total_pins = []
total_pins = driver.find_elements_by_class_name("Grid__Item")

# Pin Board Names
i = 1
while i <= len(total_pins):
    temp_xpath = "/html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[" + str(i) + "]/div/div/div[2]/div[2]/h4/a[1]"
    temp = driver.find_element_by_xpath(temp_xpath)
    #pin_Board_Names.append(temp)
    print(temp.text)
    i += 1

Kind Of Works.. Partially.. 部分作品..

Just old
Tiny House interior
SimpleLivingMama.com
Traceback (most recent call last):
  File "scrape.py", line 109, in <module>
    main()
  File "scrape.py", line 106, in main
    grab(args.url, args.fname)
  File "scrape.py", line 91, in grab
    temp = driver.find_element_by_xpath(temp_xpath)
  File "C:\Users\da74\AppData\Roaming\Python\Python36\site-packages\selenium\webdriver\remote\webdriver.py", line 393, in find_element_by_xpath
    return self.find_element(by=By.XPATH, value=xpath)
  File "C:\Users\da74\AppData\Roaming\Python\Python36\site-packages\selenium\webdriver\remote\webdriver.py", line 966, in find_element
    'value': value})['value']
  File "C:\Users\da74\AppData\Roaming\Python\Python36\site-packages\selenium\webdriver\remote\webdriver.py", line 320, in execute
    self.error_handler.check_response(response)
  File "C:\Users\da74\AppData\Roaming\Python\Python36\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with xpath '/html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[4]/div/div/div[2]/div[2]/h4/a[1]'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"187","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:57743","User-Agent":"selenium/3.13.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"xpath\", \"value\": \"/html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[4]/div/div/div[2]/div[2]/h4/a[1]\", \"sessionId\": \"a8cdaa10-a2d3-11e8-86db-a3b39599a684\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a8cdaa10-a2d3-11e8-86db-a3b39599a684/element"}}
Screenshot: available via screen

It found 3 board names for me, but then it ends with errors. 它为我找到了3个板名,但以错误结尾。 I tried to edit loop and button click, but they all seem to work. 我尝试编辑循环和按钮单击,但是它们似乎都可以工作。 Does anyone know what is causing it or maybe suggestions to explore? 有谁知道是什么原因引起的,或者也许有建议去探索?

Edit 1 : Saw the error says cannot find element by xpath. 编辑1 :看到错误,说无法通过xpath找到元素。 But the element is there on the webpage. 但是该元素在网页上。

Edit 2 : Added try:except to check. 编辑2 :添加了try:except进行检查。 Here the code: 这里的代码:

try:
            temp = driver.find_element_by_xpath(temp_xpath)
        except:
            print('no element at pin number: ' + str(i))

with output: 输出:

Just old
Tiny House interior
SimpleLivingMama.com
no element at pin number: 4
SimpleLivingMama.com
Books for Pre-Schoolers
Stuff to Try
Baby & Toddler Milestones
Toys For Boys & Girls
House
OT
Make Extra Money
Shoes
Old photos
Crafts
for baby
There's A Book About That
Geek
Real DIY
Recycle & Repurpose
Crafts
Preschool Activities
Wild West Project
#BossMoms
no element at pin number: 24
#BossMoms
Crazy for DIY
Money Saving Tips
Painting Furniture
The home I want
screen door ideas
DIY Home
Little girl rooms
Container Home Desing
Bentley Joseph Adams
some truth bombs
New house!
Advice and Wisdom-Words
no element at pin number: 37
Advice and Wisdom-Words
House ideas
Houses
no element at pin number: 40
Houses
no element at pin number: 41
Houses
Fine Motor Activities for Kids
crafts
decorating ideas
mama
Barn Homes
For the Home
no element at pin number: 48
For the Home

Checked the pin number where can't find output, but the board name is there on webpage. 检查了找不到输出的引脚号,但网页上有板名。

Edit 3 : Noticed that just after pin number 47, it always says no element found. 编辑3 :注意,在引脚号47之后,总是说找不到元素。 No matter how big the list is. 无论列表多大。 Also checked that all buttons xpaths are there in moreButtons and they're valid.. 还检查moreButtons中是否存在所有按钮xpath,并且它们是有效的。

Thanks for help in advance 预先感谢您的帮助

As helped by @AnkDasCo in the comments, found a solution to it. 正如@AnkDasCo在评论中的帮助,找到了解决方案。 There were 2 problems here: 这里有两个问题:

  1. There are 2 different xpath for the same element in Pinterest. Pinterest中同一元素有2个不同的xpath。 Some places they create 2 divs instead of just 1 for the same element. 在某些地方,他们为同一元素创建2个div,而不是1个。
  2. The webdriver needs some time to extract the element. 网络驱动程序需要一些时间来提取元素。 Though the driver waits for the page to load elements completely with default script, adding ' wait ' to webdriver helped to ensure that it tries to extract element and then move on after some time. 尽管驱动程序等待页面使用默认脚本完全加载元素,但在webdriver中添加“ wait ”有助于确保它尝试提取元素然后在一段时间后继续前进。 The same as ' time.sleep() ' but different because it's related to webdriver. 与“ time.sleep() ”相同,但是有所不同,因为它与webdriver有关。

xpaths The following are 2 xpaths for the same item: xpaths以下是同一项目的2个xpath:

  1. /html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[4]/div/div/div[2]/div/h4/a[1] / HTML /体/格[1] / DIV [1] / DIV [1] / DIV / DIV / DIV / DIV / DIV [1] / DIV / DIV / DIV / DIV [4] / DIV / DIV / DIV [ 2] / DIV / H4 / A [1]
  2. /html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[1]/div/div/div[2]/div[2]/h4/a[1] / HTML /体/格[1] / DIV [1] / DIV [1] / DIV / DIV / DIV / DIV / DIV [1] / DIV / DIV / DIV / DIV [1] / DIV / DIV / DIV [ 2] / DIV [2] / H4 / A [1]

As we notice, the last /div in both is different. 我们注意到,两者中的最后一个/ div不同。

Working Code 工作守则

    driver = webdriver.PhantomJS(executable_path='phantomjs.exe')
    print("Ghost Headless Driver Invoked")
    # driver.implicitly_wait(5) # if element not found, wait for (seconds) before next operation
    driver.get(url) # grab the url

    # Scrolling till the end of page
    print("Started Scrolling ... ")
    match=True # change to 'False' for making this work..
    lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    while(match==False):
        lastCount = lenOfPage
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

    source_data = driver.page_source # page source code as html

    # Get all pins , number of pins collected
    total_pins = []
    try:
        total_pins = driver.find_elements_by_class_name("Grid__Item")
    except:
        print("Unable to load pins")
    print("Total Pins: " + str(len(total_pins)))

    # get number of 'see more' buttons collected - for error checking
    moreButtons = driver.find_elements_by_xpath('//button[@data-test-id="seemoretoggle"]')
    print("Dynamic Elements: " + str(len(moreButtons)))
    print("Display: Dynamic Elements ... ")

    # clicking all 'See More' buttons
    i = 0
    while i <= (len(moreButtons) - 1):
        moreButtons[i].click()
        i += 1

    # Pin Board Names
    print("Extracting Board Names ... ")
    i = 1
    successful = False # for checking success of try | else not working
    while i <= len(total_pins):
        try:
            temp_xpath = "/html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[" + str(i) + "]/div/div/div[2]/div[2]/h4/a[1]"
            temp = driver.find_element_by_xpath(temp_xpath)
            pin_Board_Names.append(temp.text)
            # print("Board_No: " + str(i) + " > " + temp.text)
            successful = True
        except:
            temp_xpath = "/html/body/div[1]/div[1]/div[1]/div/div/div/div/div[1]/div/div/div/div[" + str(i) + "]/div/div/div[2]/div/h4/a[1]"
            temp = driver.find_element_by_xpath(temp_xpath)
            pin_Board_Names.append(temp.text)
            # print("Board_No: " + str(i) + " > " + temp.text)
            successful = True
        if successful == False:
            print("Board_No: " + str(i) + " not found!")
        i += 1

    # quit driver
    driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM