简体   繁体   English

如何使用python抓取多页网站并将数据导出到.csv文件?

[英]how to scrape multipage website with python and export data into .csv file?

I would like to scrape the following website using python and need to export scraped data into a CSV file: 我想使用python抓取以下网站,并且需要将抓取的数据导出到CSV文件中:

http://www.swisswine.ch/en/producer?search=&& http://www.swisswine.ch/en/producer?search=&&

This website consist of 154 pages to relevant search. 该网站包括154页相关搜索。 I need to call every pages and want to scrape data but my script couldn't call next pages continuously. 我需要调用每个页面并想抓取数据,但是我的脚本无法连续调用下一页。 It only scrape one page data. 只刮一页数据。

Here I assign value i<153 therefore this script run only for the 154th page and gave me 10 data. 在这里,我分配的值是i <153,因此该脚本仅在第154页上运行,并提供了10个数据。 I need data from 1st to 154th page 我需要第一页到第154页的数据

How can I scrape entire data from all page by once I run the script and also how to export data as CSV file?? 一旦运行脚本,如何从所有页面抓取全部数据,以及如何将数据导出为CSV文件?

my script is as follows 我的脚本如下

import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:       
     url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
     r = requests.get(url)
     i=+1
     r.content

soup = BeautifulSoup(r.content)
print (soup.prettify())


g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
      print(item.text)

You should put your HTML parsing code to under the loop as well. 您还应该将HTML解析代码放到循环下面 And you are not incrementing the i variable correctly (thanks @MattDMo): 而且您没有正确地递增i变量(感谢@MattDMo):

import csv
import requests
from bs4 import BeautifulSoup

i = 0
while i < 153:       
     url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
     r = requests.get(url)
     i += 1 

    soup = BeautifulSoup(r.content)
    print (soup.prettify())

    g_data = soup.find_all("ul", {"class": "contact-information"})
    for item in g_data:
          print(item.text)

I would also improve the following: 我还将改进以下内容:

  • use requests.Session() to maintain a web-scraping session, which will also bring a performance boost: 使用requests.Session()维护网络抓取会话,这也将带来性能提升:

    if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase 如果您要向同一主机发出多个请求,则基础TCP连接将被重用,这可能会导致性能显着提高

  • be explicit about an underlying parser for BeautifulSoup : 明确说明BeautifulSoup的基础解析器:

     soup = BeautifulSoup(r.content, "html.parser") # or "lxml", or "html5lib" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM