简体   繁体   English

如何将抓取的结果从多个网站页面保存到 CSV 文件中?

[英]How to save the scraped results into a CSV file from multiple website pages?

I'm trying to scrape some ASINs(lets say 600 ASINs) from amazon website(just the ASINs) with selenium and beautifulsoup.My main issue is how to save all the scraped data into a CSV file ?我正在尝试使用 selenium 和 beautifulsoup 从亚马逊网站(只是 ASIN)抓取一些 ASIN(比如 600 个 ASIN)。我的主要问题是如何将所有抓取的数据保存到 CSV 文件中? I've tried something but it only saves the last scraped page.我试过一些东西,但它只保存最后一个抓取的页面。

Here is the code:这是代码:

from time import sleep
import requests
import time
import json
import re
import sys
import numpy as np
from selenium import webdriver
import urllib.request
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
import pandas as pd
from urllib.request import urlopen


i = 1
while(True):
    try:
        if i == 1:
            url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page=1"
        else:
            url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page={}".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        #print page url
        print(url)

        #rest of the scraping code
        driver = webdriver.Chrome()
        driver.get(url)

        HTML = driver.page_source
        HTML1=driver.page_source
        soup = BeautifulSoup(HTML1, "html.parser")
        styles = soup.find_all(name="div", attrs={"data-asin":True})
        res1 = [i.attrs["data-asin"] for i in soup.find_all("div") if i.has_attr("data-asin")]
        print(res1)
        data_record.append(res1)
        #driver.close()

        #don't overflow website
        sleep(1)

        #increase page number
        i += 1
        if i == 3:
            print("STOP!!!")
            break
    except:
        break



Removing items that do not seem to be used at the moment a possible solution could be删除目前似乎没有使用的项目可能的解决方案是

import csv
import bs4
import requests
from selenium import webdriver
from time import sleep


def retrieve_asin_from(base_url, idx):
    url = base_url.format(idx)
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.content, 'html.parser')

    with webdriver.Chrome() as driver:
        driver.get(url)
        HTML1 = driver.page_source
        soup = bs4.BeautifulSoup(HTML1, "html.parser")
        res1 = [i.attrs["data-asin"]
                for i in soup.find_all("div") if i.has_attr("data-asin")]
    sleep(1)
    return res1


url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page={}"
data_record = [retrieve_asin_from(url, i) for i in range(1, 4)]

combined_data_record = combine_records(data_record) # fcn to write

with open('asin_data.csv', 'w', newline='') as fd:
    csvfile = csv.writer(fd)
    csvfile.writerows(combined_data_record)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM