簡體   English   中英

Scrapy搜尋器在搜尋后未獲取數據

[英]Scrapy crawler not getting data after crawling

我對scrapy還是陌生的,但是當我運行我的代碼時,Debug沒有任何錯誤地返回,並且當我查看它已報廢的數據量時,應該不會這樣嗎? 下面是我的代碼。 我試圖從Tripadvisor獲得評論。

import HTMLParser
import unicodedata
import re
import time

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule



class scrapingtestSpider(CrawlSpider):
    name = "scrapingtest"

    allowed_domains = ["tripadvisor.com"]
    base_uri = "http://www.tripadvisor.com"
    start_urls = [
        base_uri + "/RestaurantSearch?geo=60763&q=New+York+City%2C+New+York&cat=&pid="
    ]




htmlparser = HTMLParser.HTMLParser()

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

def clean_parsed_string(string):
    if len(string) > 0:
        ascii_string = string
        if is_ascii(ascii_string) == False:
            ascii_string = unicodedata.normalize('NFKD', ascii_string).encode('ascii', 'ignore')
        return str(ascii_string)
    else:
        return None

def get_parsed_string(selector, xpath):
    return_string = ''
    extracted_list = selector.xpath(xpath).extract()
    if len(extracted_list) > 0:
        raw_string = extracted_list[0].strip()
        if raw_string is not None:
            return_string = htmlparser.unescape(raw_string)
    return return_string

def get_parsed_string_multiple(selector, xpath):
    return_string = ''
    return selector.xpath(xpath).extract()


def parse(self, response):
    tripadvisor_items = []

    sel = Selector(response)
    snode_restaurants = sel.xpath('//div[@id="EATERY_SEARCH_RESULTS"]/div[starts-with(@class, "listing")]')

    # Build item index.
    for snode_restaurant in snode_restaurants:
        # Cleaning string and taking only the first part before whitespace.
        snode_restaurant_item_avg_stars = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class="wrap"]/div[@class="entry wrap"]/div[@class="description"]/div[@class="wrap"]/div[@class="rs rating"]/span[starts-with(@class, "rate")]/img[@class="sprite-ratings"]/@alt'))
        tripadvisor_item['avg_stars'] = re.match(r'(\S+)', snode_restaurant_item_avg_stars).group()

        # Popolate reviews and address for current item.
        yield Request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_search_page)





def parse_fetch_review(self, response):
        tripadvisor_item = response.meta['tripadvisor_item']
        sel = Selector(response)

        counter_page_review = response.meta['counter_page_review']

            # TripAdvisor reviews for item.
        snode_reviews = sel.xpath('//div[@id="REVIEWS"]/div/div[contains(@class, "review")]/div[@class="col2of2"]/div[@class="innerBubble"]')

        # Reviews for item.
        for snode_review in snode_reviews:
            tripadvisor_review_item = ScrapingtestreviewItem()

            tripadvisor_review_item['title'] = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="quote"]/text()'))

            # Review item description is a list of strings.
            # Strings in list are generated parsing user intentional newline. DOM: <br>
            tripadvisor_review_item['description'] = get_parsed_string_multiple(snode_review, 'div[@class="entry"]/p/text()')
            # Cleaning string and taking only the first part before whitespace.
            snode_review_item_stars = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="rating reviewItemInline"]/span[starts-with(@class, "rate")]/img/@alt'))
            tripadvisor_review_item['stars'] = re.match(r'(\S+)', snode_review_item_stars).group()

            snode_review_item_date = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="rating reviewItemInline"]/span[@class="ratingDate"]/text()'))
            snode_review_item_date = re.sub(r'Reviewed ', '', snode_review_item_date, flags=re.IGNORECASE)
            snode_review_item_date = time.strptime(snode_review_item_date, '%B %d, %Y') if snode_review_item_date else None
            tripadvisor_review_item['date'] = time.strftime('%Y-%m-%d', snode_review_item_date) if snode_review_item_date else None

            tripadvisor_item['reviews'].append(tripadvisor_review_item)

這是調試日志

C:\Users\smash_000\Desktop\scrapingtest\scrapingtest>scrapy crawl scrapingtest -
o items.json
C:\Users\smash_000\Desktop\scrapingtest\scrapingtest\spiders\scrapingtest_spider
.py:6: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scra
py.spiders` instead
  from scrapy.spider import BaseSpider
C:\Users\smash_000\Desktop\scrapingtest\scrapingtest\spiders\scrapingtest_spider
.py:9: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated,
use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule
2015-07-14 11:07:04 [scrapy] INFO: Scrapy 1.0.1 started (bot: scrapingtest)

2015-07-14 11:07:04 [scrapy] INFO: Optional features available: ssl, http11

2015-07-14 11:07:04 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sc

rapingtest.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['scrapingtest.spi
ders'], 'FEED_URI': 'items.json', 'BOT_NAME': 'scrapingtest'}
2015-07-14 11:07:04 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-14 11:07:05 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-14 11:07:05 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-07-14 11:07:05 [scrapy] INFO: Enabled item pipelines:

2015-07-14 11:07:05 [scrapy] INFO: Spider opened

2015-07-14 11:07:05 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-07-14 11:07:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-07-14 11:07:06 [scrapy] DEBUG: Crawled (200) <GET http://www.tripadvisor.co
m/RestaurantSearch?geo=60763&q=New+York+City%2C+New+York&cat=&pid=> (referer: No
ne)

2015-07-14 11:07:06 [scrapy] INFO: Closing spider (finished)

2015-07-14 11:07:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 281,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 46932,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 14, 5, 37, 6, 929000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 7, 14, 5, 37, 5, 474000)}
2015-07-14 11:07:06 [scrapy] INFO: Spider closed (finished)

您是否嘗試過使用print語句調試代碼?

我試圖執行您的解析器。 如果我按原樣復制提供的代碼,則會得到相同的結果,因為蜘蛛類scrapingtestSpider沒有parse方法,因此不會被調用。

如果我對您的代碼進行某種格式設置(我將start_urls下的所有內容縮進類中),則會收到一些錯誤,指出輔助方法未由其全局名稱定義。

如果我走得更遠,只剩下要對爬蟲進行parse方法,則會遇到其他錯誤,其中提到未定義tripadvisor_item

嘗試在IDE中更好地格式化代碼,並將print消息添加到parse方法中,以查看是否被調用。 當Scrapy抓取第一個URL時,應輸入main parse方法。 我認為這行不通。

順便提一下,添加到Request的回調也被命名為bad:

yield Request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_search_page)

應該更改為

yield Request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_fetch_review)

當您解決縮進問題時。

並且在parse_fetch_review方法的末尾returnyield您在parse方法中創建的tripadvisor_item

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM