[英]How to scrape the next pages in python using Beautifulsoup
Suppose I am scraping a url 假设我正在抓取网址
http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha
and it contents no of pages which contains the data which I want to scrape. 它不包含任何包含我要抓取的数据的页面。 So how can I scrape the data of all the next pages. 因此,我该如何抓取所有下一页的数据。 I am using python 3.5.1 and Beautifulsoup. 我正在使用python 3.5.1和Beautifulsoup。 Note: I can't use scrapy and lxml as it is giving me some installation error. 注意:我不能使用scrapy和lxml,因为它给了我一些安装错误。
Determine the last page by extracting the page
argument of the "Go to the last page" element. 通过提取“转到最后一页”元素的page
参数来确定最后一页。 And loop over every page maintaining a web-scraping session via requests.Session()
: 并通过requests.Session()
遍历每个页面以维护Web抓取会话:
import re
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
# extract the last page
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")
soup = BeautifulSoup(response.content, "html.parser")
last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))
# loop over every page
for page in range(last_page):
response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
soup = BeautifulSoup(response.content, "html.parser")
# print the title of every search result
for result in soup.select("li.search-result"):
title = result.find("div", class_="title").get_text(strip=True)
print(title)
Prints: 印刷品:
A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.