简体   繁体   English

使用 Selenium 和 Python 处理超时

[英]Handling timeout with Selenium and Python

can anybody help me with this?有人可以帮我吗? I have written a code to scrape articles from a Chinese news site, using Selenium.我使用 Selenium 编写了一个从中文新闻网站上抓取文章的代码。 As many of the urls do not load, I tried to include code to catch timeout exceptions, which works but then the browser seems to stay on page which timed out when loading, rather than moving to try the next url.由于许多 url 没有加载,我尝试包含代码来捕获超时异常,这可以工作,但是浏览器似乎停留在加载时超时的页面上,而不是尝试下一个 url。

I've tried adding driver.quit() and driver.close() after handling the error, but then it doesn't work when continuing to the next loop.我在处理错误后尝试添加 driver.quit() 和 driver.close() ,但是在继续下一个循环时它不起作用。

with open('url_list_XB.txt', 'r') as f:
    url_list = f.readlines()

for idx, url in enumerate(url_list):
    status = str(idx)+" "+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title', str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt', 'a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split(" 来源")[0]
                            dt2 = dt.replace(":", "_").replace("-", "_").replace(" ", "_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url, tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

I want to the code to move through a list of urls, calling them and grabbing the article text as well as a link to the discussion page.我想让代码在 url 列表中移动,调用它们并获取文章文本以及讨论页面的链接。 It works until a page times out on loading, then the programme will not move on.它一直工作到页面加载超时,然后程序将不会继续。

I can't exactly understand what your code is doing because I have no context for the page you are automating, but I can provide a general structure for how you would accomplish something like this.我无法完全理解您的代码在做什么,因为我没有您正在自动化的页面的上下文,但我可以提供一个通用结构来说明您将如何完成这样的事情。 Here's a simplified version of how I would handle your scenario:这是我将如何处理您的场景的简化版本:

# iterate URL list
for url in url_list:

    # navigate to a URL
    driver.get(url)

    # check something here to test if a link is 'broken' or not
    try: 
        driver.find_element(someLocator)

    # if link is broken, go back
    except TimeoutException:
        driver.back()
        # continue so we can return to beginning of loop
        continue

    # if you reach this point, the link is valid, and you can 'do stuff' on the page

This code navigates to the URL, and performs some check (that you specify) to see if the link is 'broken' or not.此代码导航到 URL,并执行一些检查(您指定)以查看链接是否“断开”。 We check for broken link by catching the TimeoutException that gets thrown.我们通过捕获抛出的TimeoutException来检查断开的链接。 If the exception is thrown, we navigate to the previous page, then call continue to return to the beginning of the loop, and start over with the next URL.如果抛出异常,我们导航到上一页,然后调用continue返回到循环的开头,并从下一个 URL 重新开始。

If we make it through the try / except block, then the URL is valid and we are on the correct page.如果我们通过try / except块,那么 URL 是有效的,我们在正确的页面上。 In this place, you can write your code to scrape the articles or whatever you need to do.在这个地方,您可以编写代码来抓取文章或您需要做的任何事情。

The code the appears after try / except will ONLY be hit if TimeoutException is NOT encountered -- meaning the URL is valid.只有在没有遇到TimeoutException时才会触发try / except之后出现的代码——这意味着 URL 是有效的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM