如何向下滚动并单击按钮以在python中连续网页抓取页面

Question

我想抓取整个页面以获取帐户链接，但问题是：

我需要多次单击“ Load more按钮才能获取要抓取的完整帐户列表
偶尔会出现一个弹出窗口，所以我如何检测它并单击取消按钮

如果可能的话，我更喜欢只用请求来抓取整页。 因为我必须点击按钮所以想到使用硒。

这是我的代码：

import time
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://society6.com/franciscomffonseca/followers')

time.sleep(3)

try: driver.find_element_by_class_name('bx-button').click() #button to remove popup

except: print("no popups")

driver.find_element_by_class_name('loadMore').click #to click load more button

我正在使用一个有 10K 粉丝的测试页面，并想抓取他们的粉丝帐户链接。 我已经对刮板进行了编码，所以只需要查看完整的网页

https://society6.com/franciscomffonseca/followers

抓取代码以防万一：

r2 = requests.get('https://society6.com/franciscomffonseca/followers')
print(r2.status_code)
r2.raise_for_status

soup2 = BeautifulSoup(r2.content, "html.parser")
a2_tags = soup2.find_all(attrs={"class": "user"})

#attrs={"class": "user-list clearfix"}

follow_accounts = []

for a2 in a2_tags:
    follow_accounts.append('https://society6.com'+a2['href'])

print(follow_accounts)
print("number of accounts scraped: " + str(len(follow_accounts)))

load more按钮的HTML：

 <button class="loadMore" onclick="loadMoreFollowers();">Load More</button>

Answer 1

您可以直接向 Society6 API 提出请求，如下所示：

counter = 1

while True:
    source = requests.get('https://society6.com/api/users/franciscomffonseca/followers?page=%s' % counter).json()
    if source['data']['attributes']['followers']:
        for i in source['data']['attributes']['followers']:
            print(i['card']['link']['href'])
        counter += 1
    else:
        break

这将打印相对hrefs

/wickedhonna
/wiildrose
/williamconnolly
/whiteca1x

如果你想要绝对的 hrefs 只需替换

print(i['card']['link']['href'])

和

print("https://society6.com" + i['card']['link']['href'])

如何向下滚动并单击按钮以在python中连续网页抓取页面

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-09-17 09:00:47

如何向下滚动并单击按钮以在python中连续网页抓取页面

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-09-17 09:00:47

解决方案1
4 已采纳 2018-09-17 09:00:47