简体   繁体   English

刮擦空的csv文件

[英]scrapy empty csv file

I'm trying to run my scrapy spider, it returns no error but output his an empty csv file 我正在尝试运行我的小蜘蛛,它没有返回错误,但输出了一个空的csv文件

i'm launching the spider through command line scrapy crawl AnimeReviews -o AnimeReviews.csv -t csv 我正在通过命令行抓取爬行启动蜘蛛AnimeReviews -o AnimeReviews.csv -t csv

this is library that i used 这是我用过的图书馆

import scrapy
import json
from functools import reduce
from scrapy.selector import Selector
from AnimeReviews.items import AnimereviewsItem
last_page = 1789

this is my spider 这是我的蜘蛛

class AnimeReviewsSpider(scrapy.Spider):
    name = 'AnimeReviews_spider'
    allowed_urls =['myanimelist.net']
    start_urls = ['https://myanimelist.net/reviews.php?t=anime']

def parse(self, response):
    page_urls = [response.url + "&p=" + str(pageNumber) for pageNumber in range(1, last_page+1)]
    for page_url in page_urls:
        yield scrapy.Request(page_url,
            callback = self.parse_reviews_page)

def parse_reviews_page(self, response):
    item = AnimereviewsItem()
    reviews = response.xpath('//*[@class="borderDark pt4 pb8 pl4 pr4 mb8"]').extract()       #each page displays 50 reviews

    for review in reviews:
        anime_title = Selector(text = review).xpath('//div[1]/a[1]/strong/text()').extract()
        anime_url = Selector(text = review).xpath('//a[@class="hoverinfo_trigger"]/@href').extract()
        anime_url = map(lambda x: 'https://myanimelist.net'+ x ,anime_url)
        review_time = Selector(text = review).xpath('//*[@style="float: right;"]/text()').extract()[0]
        reviewer_name = Selector(text = review).xpath('//div[2]/table/tr/td[2]/a/text()').extract()
        rating = Selector(text = review).xpath('//div[2]/table/tr/td[3]/div[2]/text()').extract()
        for i in range(len(rating)):
            rating_temp = rating[i]
            rating[i] = rating_temp.split(" ")[1]
        review_text = Selector(text = review).xpath('//*[@class="spaceit textReadability word-break"]').extract()
        for i in range(len(review_text)):
            text = Selector(text = review_text[i]).xpath('//text()').extract()
        pic_url = Selector(text = review).xpath('//div[3]/div[1]/div[1]/a/img/@data-src').extract()
        item['anime_title'] = anime_title
        item['anime_url'] = anime_url
        item['review_time'] = review_time
        item['reviewer'] = reviewer_name
        item['rating'] = rating
        item['review_text'] = review_text
        item['pic_url'] = pic_url
        yield item

this is log after crawling 这是爬网后的日志

2018-06-22 13:37:14 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-22 13:37:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 698849,
 'downloader/request_count': 1791,
 'downloader/request_method_count/GET': 1791,
 'downloader/response_bytes': 148209070,
 'downloader/response_count': 1791,
 'downloader/response_status_count/200': 1791,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 22, 11, 37, 14, 546133),
 'log_count/DEBUG': 1792,
 'log_count/INFO': 13,
 'request_depth_max': 1,
 'response_received_count': 1791,
 'scheduler/dequeued': 1790,
 'scheduler/dequeued/memory': 1790,
 'scheduler/enqueued': 1790,
 'scheduler/enqueued/memory': 1790,
 'start_time': datetime.datetime(2018, 6, 22, 11, 30, 38, 403920)}
2018-06-22 13:37:14 [scrapy.core.engine] INFO: Spider closed (finished)

if you need more informations let me know. 如果您需要更多信息,请告诉我。

The biggest problem here are you xpath expressions. 这里最大的问题是您的xpath表达式。
They look like they were automatically generated, and they are too specific. 它们看起来像是自动生成的,而且过于具体。

For example, not even your xpath for reviews matches anything. 例如, reviews的xpath都不匹配任何东西。
Something as simple as //div[@class="borderDark"] matches all 50 reviews on a page, as does the css expression .borderDark . //div[@class="borderDark"]这样简单的事情就可以匹配页面上的所有50条评论,css表达式.borderDark

I would recommend getting familiar with xpath and/or css selectors, and writing your selectors by hand. 我建议您熟悉xpath和/或CSS选择器,并手动编写选择器。

Also, you're converting selectors to text (using .extract ), and then back to selectors (using Selector ). 另外,您正在将选择器转换为文本(使用.extract ),然后又转换回选择器(使用Selector )。 There's no need to do that, you can simply work with the selectors returned by .xpath . 无需这样做,您只需使用.xpath返回的选择器.xpath

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM