简体   繁体   English

Scrapy Spider保存到csv

[英]Scrapy Spider save to csv

I am trying to scrape a website and save the information and I have two issues at the moment. 我正在尝试抓取一个网站并保存信息,此刻我有两个问题。

For one, when I am using selenium to click buttons (in this case a load more results button) it is not clicking until the end and I can't seem to figure out why. 例如,当我使用硒单击按钮(在本例中为“加载更多结果”按钮)时,它直到最后都没有单击,而且我似乎无法弄清楚原因。

And the other issue is that it is not saving to a csv file in the parse_article function. 另一个问题是它没有保存到parse_article函数中的csv文件中。

Here is my code: 这是我的代码:

import scrapy
from selenium import webdriver
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from selenium.webdriver.common.by import By
import csv


class ProductSpider(scrapy.Spider):
    name = "Southwestern"
    allowed_domains = ['www.reuters.com/']
    start_urls = [
        'https://www.reuters.com/search/news?blob=National+Health+Investors%2c+Inc.']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_class_name(
                "search-result-more-txt")
        #next = self.driver.find_element_by_xpath('//*[@id="content"]/section[2]/div/div[1]/div[4]/div/div[4]/div[1]')
        # maybe do it with this
        #button2 = driver.find_element_by_xpath("//*[contains(text(), 'Super')]")
            try:
                next.click()

            # get the data and write it to scrapy items
            except:
                break

        SET_SELECTOR = '.search-result-content'
        for articles in self.driver.find_elements(By.CSS_SELECTOR, SET_SELECTOR):
            item = {}
            # get the date
            item["date"] = articles.find_element_by_css_selector('h5').text
            # title
            item["title"] = articles.find_element_by_css_selector('h3 a').text

            item["link"] = articles.find_element_by_css_selector(
                'a').get_attribute('href')

            print(item["link"])

            yield scrapy.Request(url=item["link"], callback=self.parse_article, meta={'item': item})
        self.driver.close()

    def parse_article(self, response):
        item = response.meta['item']

        texts = response.xpath(
            "//div[contains(@class, 'StandardArticleBody')]//text()").extract()
        if "National Health Investors" in texts:
            item = response.meta['item']
            row = [item["date"], item["title"], item["link"]]
            with open('Websites.csv', 'w') as outcsv:
                writer = csv.writer(outcsv)
                writer.writerow(row)
  1. Try to wait a bit after click so that data will be loaded. 单击后尝试稍等,以便加载数据。 I suppose sometimes your script searches for a button before new data and a new button were displayed. 我想有时候您的脚本会在显示新数据和新按钮之前搜索按钮。

Try using implicit_wait or explicit_wait: 尝试使用隐式等待或显式等待:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
# (or elements) not immediately available.
driver.implicitly_wait(implicit_wait)

# An explicit wait is code you define to wait for a certain condition to occur before proceeding further
# in the code.
wait = WebDriverWait(self.driver, <time in seconds>)
wait.until(EC.presence_of_element_located((By.XPATH, button_xpath)))
  1. 'w' is for writing only (an existing file with the same name will be erased). “ w”仅用于写入(具有相同名称的现有文件将被删除)。 Try 'a' (appending) argument. 尝试使用“ a”(附加)参数。 Though I would recommend using pipelines: link 虽然我建议使用管道: 链接

First issue looks like button hasn't appeared. 第一个问题似乎没有出现。 Maybe this can aid you. 也许可以帮助您。

One more thing, try to close driver when Scrapy is shutting down. 还有一件事,尝试在Scrapy关闭时关闭driver Probably this can help you. 也许可以帮助你。

Second issue looks like you are going to do open and write many times and that is not good, since you will be overwriting existing contents. 第二个问题看起来您将要进行很多次open和编写操作,但这并不好,因为您将覆盖现有内容。 Even with 'a' flag eg open(FILE_NAME, 'a') this is not good practice in Scrapy. 即使使用'a'标志,例如open(FILE_NAME, 'a')在Scrapy中也不是open(FILE_NAME, 'a')好习惯。

Try to create Item populate it and then use Pipelines mechanism for saving items in CSV file. 尝试创建填充它的Item ,然后使用Pipelines机制将项目保存在CSV文件中。 Something like here . 这里的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM