[英]How to scroll down and click button for continuous web scraping the page in python
我想抓取整个页面以获取帐户链接,但问题是:
我需要多次单击“ Load more
按钮才能获取要抓取的完整帐户列表
偶尔会出现一个弹出窗口,所以我如何检测它并单击取消按钮
如果可能的话,我更喜欢只用请求来抓取整页。 因为我必须点击按钮所以想到使用硒。
这是我的代码:
import time
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://society6.com/franciscomffonseca/followers')
time.sleep(3)
try: driver.find_element_by_class_name('bx-button').click() #button to remove popup
except: print("no popups")
driver.find_element_by_class_name('loadMore').click #to click load more button
我正在使用一个有 10K 粉丝的测试页面,并想抓取他们的粉丝帐户链接。 我已经对刮板进行了编码,所以只需要查看完整的网页
https://society6.com/franciscomffonseca/followers
抓取代码以防万一:
r2 = requests.get('https://society6.com/franciscomffonseca/followers')
print(r2.status_code)
r2.raise_for_status
soup2 = BeautifulSoup(r2.content, "html.parser")
a2_tags = soup2.find_all(attrs={"class": "user"})
#attrs={"class": "user-list clearfix"}
follow_accounts = []
for a2 in a2_tags:
follow_accounts.append('https://society6.com'+a2['href'])
print(follow_accounts)
print("number of accounts scraped: " + str(len(follow_accounts)))
load more
按钮的HTML:
<button class="loadMore" onclick="loadMoreFollowers();">Load More</button>
您可以直接向 Society6 API 提出请求,如下所示:
counter = 1
while True:
source = requests.get('https://society6.com/api/users/franciscomffonseca/followers?page=%s' % counter).json()
if source['data']['attributes']['followers']:
for i in source['data']['attributes']['followers']:
print(i['card']['link']['href'])
counter += 1
else:
break
这将打印相对hrefs
/wickedhonna
/wiildrose
/williamconnolly
/whiteca1x
如果你想要绝对的 hrefs 只需替换
print(i['card']['link']['href'])
和
print("https://society6.com" + i['card']['link']['href'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.