[英]how to scrape multipage website with python and export data into .csv file?
I would like to scrape the following website using python and need to export scraped data into a CSV file: 我想使用python抓取以下网站,并且需要将抓取的数据导出到CSV文件中:
http://www.swisswine.ch/en/producer?search=&& http://www.swisswine.ch/en/producer?search=&&
This website consist of 154 pages to relevant search. 该网站包括154页相关搜索。 I need to call every pages and want to scrape data but my script couldn't call next pages continuously.
我需要调用每个页面并想抓取数据,但是我的脚本无法连续调用下一页。 It only scrape one page data.
只刮一页数据。
Here I assign value i<153 therefore this script run only for the 154th page and gave me 10 data. 在这里,我分配的值是i <153,因此该脚本仅在第154页上运行,并提供了10个数据。 I need data from 1st to 154th page
我需要第一页到第154页的数据
How can I scrape entire data from all page by once I run the script and also how to export data as CSV file?? 一旦运行脚本,如何从所有页面抓取全部数据,以及如何将数据导出为CSV文件?
my script is as follows 我的脚本如下
import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:
url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
r = requests.get(url)
i=+1
r.content
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
print(item.text)
You should put your HTML parsing code to under the loop as well. 您还应该将HTML解析代码放到循环下面 。 And you are not incrementing the
i
variable correctly (thanks @MattDMo): 而且您没有正确地递增
i
变量(感谢@MattDMo):
import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:
url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
r = requests.get(url)
i += 1
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
print(item.text)
I would also improve the following: 我还将改进以下内容:
use requests.Session()
to maintain a web-scraping session, which will also bring a performance boost: 使用
requests.Session()
维护网络抓取会话,这也将带来性能提升:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
如果您要向同一主机发出多个请求,则基础TCP连接将被重用,这可能会导致性能显着提高
be explicit about an underlying parser for BeautifulSoup
: 明确说明
BeautifulSoup
的基础解析器:
soup = BeautifulSoup(r.content, "html.parser") # or "lxml", or "html5lib"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.