[英]Scrapy Spider save to csv
I am trying to scrape a website and save the information and I have two issues at the moment. 我正在尝试抓取一个网站并保存信息,此刻我有两个问题。
For one, when I am using selenium to click buttons (in this case a load more results button) it is not clicking until the end and I can't seem to figure out why. 例如,当我使用硒单击按钮(在本例中为“加载更多结果”按钮)时,它直到最后都没有单击,而且我似乎无法弄清楚原因。
And the other issue is that it is not saving to a csv file in the parse_article function. 另一个问题是它没有保存到parse_article函数中的csv文件中。
Here is my code: 这是我的代码:
import scrapy
from selenium import webdriver
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from selenium.webdriver.common.by import By
import csv
class ProductSpider(scrapy.Spider):
name = "Southwestern"
allowed_domains = ['www.reuters.com/']
start_urls = [
'https://www.reuters.com/search/news?blob=National+Health+Investors%2c+Inc.']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_class_name(
"search-result-more-txt")
#next = self.driver.find_element_by_xpath('//*[@id="content"]/section[2]/div/div[1]/div[4]/div/div[4]/div[1]')
# maybe do it with this
#button2 = driver.find_element_by_xpath("//*[contains(text(), 'Super')]")
try:
next.click()
# get the data and write it to scrapy items
except:
break
SET_SELECTOR = '.search-result-content'
for articles in self.driver.find_elements(By.CSS_SELECTOR, SET_SELECTOR):
item = {}
# get the date
item["date"] = articles.find_element_by_css_selector('h5').text
# title
item["title"] = articles.find_element_by_css_selector('h3 a').text
item["link"] = articles.find_element_by_css_selector(
'a').get_attribute('href')
print(item["link"])
yield scrapy.Request(url=item["link"], callback=self.parse_article, meta={'item': item})
self.driver.close()
def parse_article(self, response):
item = response.meta['item']
texts = response.xpath(
"//div[contains(@class, 'StandardArticleBody')]//text()").extract()
if "National Health Investors" in texts:
item = response.meta['item']
row = [item["date"], item["title"], item["link"]]
with open('Websites.csv', 'w') as outcsv:
writer = csv.writer(outcsv)
writer.writerow(row)
Try using implicit_wait or explicit_wait: 尝试使用隐式等待或显式等待:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
# (or elements) not immediately available.
driver.implicitly_wait(implicit_wait)
# An explicit wait is code you define to wait for a certain condition to occur before proceeding further
# in the code.
wait = WebDriverWait(self.driver, <time in seconds>)
wait.until(EC.presence_of_element_located((By.XPATH, button_xpath)))
First issue looks like button hasn't appeared. 第一个问题似乎没有出现。 Maybe this can aid you. 也许这可以帮助您。
One more thing, try to close driver
when Scrapy is shutting down. 还有一件事,尝试在Scrapy关闭时关闭driver
。 Probably this can help you. 也许这可以帮助你。
Second issue looks like you are going to do open
and write many times and that is not good, since you will be overwriting existing contents. 第二个问题看起来您将要进行很多次open
和编写操作,但这并不好,因为您将覆盖现有内容。 Even with 'a' flag eg open(FILE_NAME, 'a')
this is not good practice in Scrapy. 即使使用'a'标志,例如open(FILE_NAME, 'a')
在Scrapy中也不是open(FILE_NAME, 'a')
好习惯。
Try to create Item
populate it and then use Pipelines
mechanism for saving items in CSV file. 尝试创建填充它的Item
,然后使用Pipelines
机制将项目保存在CSV文件中。 Something like here . 像这里的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.