简体   繁体   English

使用Selenium,beautifulsoup和python进行网络抓取

[英]Webscraping using selenium, beautifulsoup and python

Currently scraping a real estate website that is using javascript. 目前正在抓取使用JavaScript的房地产网站。 My process starts by scraping a list containing many different href links for single listings, appending these links to another list and then pressing the next button. 我的过程从抓取一个包含多个不同href链接的列表开始,然后将这些链接附加到另一个列表,然后按下一步按钮。 I do this til the the next button is no longer clickable. 我这样做直到下一个按钮不再可单击。

my problem is that after collecting all the listings (~13000 links) the scraper doesn't move onto the second part where it opens the links and gets the info I need. 我的问题是,在收集了所有列表(〜13000个链接)之后,刮板不会移到第二部分上,在此打开链接并获取我需要的信息。 Selenium doesn't even open to move onto the first element of the list of links. Selenium甚至没有开放到链接列表的第一个元素上。

heres my code: 这是我的代码:

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass

After this I have another simple scraper that goes through the list of listings, opens them in selenium and collects data on that listing. 之后,我还有另一个简单的刮板,它遍历了清单列表,用硒打开它们,并收集该清单上的数据。

for links in houselinklist:
    print(links)
    newwebpage = links
    driver.get(newwebpage)
    html = driver.page_source
    soup = bs.BeautifulSoup(html,'html.parser')
    .
    .
    .
    . more code here

The problem is while True: creates a loop that runs infinity. 问题是while True:创建一个无限循环。 Your except clause has a pass statement, which means once an error occurs, the loop just continues to run. 您的except子句具有pass语句,这意味着一旦发生错误,循环将继续运行。 Instead it can be written as 相反,它可以写成

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        break # change this to exit loop

once an error occurs, the loop will break and move on to the next line of code 一旦发生错误,循环将break并继续执行下一行代码

or you can just you can eliminate the while loop and just loop over your list of href links with a for loop 或者,您可以消除while循环,而仅使用for循环遍历href链接列表

wait = WebDriverWait(driver, 10)
hrefLinks = ['link1','link2','link3'.....]
for link in hrefLinks:
    try:
        driver.get(link)
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass #pass on error and move on to next hreflink

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM