简体   繁体   English

python scrapy没有抓取抓取列表中的所有网址

[英]python scrapy not crawling all urls in scraped list

I am trying to scrape information from the pages listed on this page.我正在尝试从此页面上列出的页面中抓取信息。 https://pardo.ch/pardo/program/archive/2017/catalog-films.html https://pardo.ch/pardo/program/archive/2017/catalog-films.html

the xpath selector: xpath 选择器:

film_page_urls_startpage = sel.xpath('//article[@class="strip-list_link_all strip-list strip--color row row--5"]/a/@href').extract()

correctly scrapes all 23 urls.正确抓取所有 23 个网址。 however, the spider doesn't even appear to try crawling all 23. it crawls only 11. the same 11 each time.然而,蜘蛛似乎甚至没有尝试爬行所有 23 个。它每次只爬行 11 个。相同的 11 个。 since I'm using selenium, I can see it just jump right over the first page/url without ever navigating to it at all.因为我使用的是 selenium,所以我可以看到它直接跳过第一页/url,而根本没有导航到它。 what gives?是什么赋予了?

this is my code:这是我的代码:

from scrapy import Spider
from scrapy.http import Request
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from scrapy.loader import ItemLoader
from films_locarno.items import FilmsLocarnoItemfrom scrapy import 

class FilmsLocarnoSpiderSpider(Spider):
name = 'films_locarno_spider'
allowed_domains = ['https://pardo.ch/']
start_urls = ['https://pardo.ch/pardo/program/archive/2017/catalog-films.html']

def start_requests(self):
    self.driver = webdriver.Firefox()
    self.driver.get('https://pardo.ch/pardo/program/archive/2017/catalog-films.html')
    sel = Selector(text=self.driver.page_source)

    #grab list of start pages for all 4/5 editions of festival available
    #list of film page urls on start page (letter A)
    film_page_urls_startpage = sel.xpath('//article[@class="strip-    list_link_all strip-list strip--color row row--5"]/a/@href').extract()
    film_page_urls_startpage_full = []
    for url in film_page_urls_startpage:
        film_page_fullurl = "https://pardo.ch" + url
        film_page_urls_startpage_full.append(film_page_fullurl)

    #navigate to startpage film_pages
    for url3 in film_page_urls_startpage_full:
        self.driver.get(url3)
        sel = Selector(text=self.driver.page_source)
        self.logger.info('Sleeping for 1 second')
        sleep(1)
        yield Request(url3, callback=self.parse_filmpage)
        self.logger.info('Sleeping for 2 seconds')
        sleep(2) 

my output log reads [you can ignore the ERROR, its only a page navigation error, since fixed]:我的输出日志显示[你可以忽略错误,它只是一个页面导航错误,因为已修复]:

    2017-12-26 09:29:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: films_locarno)
2017-12-26 09:29:33 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['films_locarno.spiders'], 'BOT_NAME': 'films_locarno', 'NEWSPIDER_MODULE': 'films_locarno.spiders', 'FEED_URI': 'films_locarno6.csv', 'FEED_FORMAT': 'csv'}
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2017-12-26 09:29:33 [scrapy.core.engine] INFO: Spider opened
2017-12-26 09:29:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:29:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-26 09:29:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session {"capabilities": {"firstMatch": [], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true}}
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/catalog-films.html"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:29:57 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:29:59 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:04 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:10 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:12 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:15 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:17 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:20 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:22 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:26 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:28 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=929220&eid=70"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:33 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:35 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960742&eid=70"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:30:39 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960703&eid=70"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:45 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:47 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=963699&eid=70"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70> (referer: None)
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70> (referer: None)
2017-12-26 09:30:51 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=964462&eid=70"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70> (referer: None)
2017-12-26 09:30:59 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:02 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:05 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch<a href=\"?finit=B\" class=\"dd__list__link\">B</a>"}
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:31:07 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/Users/MNK1/Desktop/films_locarno/films_locarno/spiders/films_locarno_spider.py", line 48, in start_requests
    self.driver.get(films_list_page)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in get
    self.execute(Command.GET, {'url': url})
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Malformed URL: https://pardo.ch<a href="?finit=B" class="dd__list__link">B</a> is not a valid URL.

2017-12-26 09:31:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70> (referer: None)
2017-12-26 09:31:07 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70> (referer: None)
2017-12-26 09:31:10 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:13 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F430%2FOC973705_P3001_240430.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70> (referer: None)
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70>
{'color': ['Color'],
 'country': ['Pakistan, USA'],
 'director': [''],
 'festival_edition': ['70th'],
 'festival_year': ['2017'],
 'film_year': ['2015'],
 'format_': ['DCP'],
 'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'],
 'images': [{'checksum': '89dd9751e436eed7ae35f980c2e10bc3',
             'path': 'full/53cb39b642dcd6cea1e7898c9dc4777b844ea4fd.jpg',
             'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'}],
 'language': ['Urdu'],
 'length': ["40'"],
 'program': ['Open Doors: Screenings'],
 'title': ['A Girl in the River: The Price of Forgiveness']}
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70>
{'color': ['Color'],
 'country': ['Switzerland'],
 'director': [''],
 'festival_edition': ['70th'],
 'festival_year': ['2017'],
 'film_year': ['2017'],
 'format_': ['DCP'],
 'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'],
 'images': [{'checksum': 'cce5e9ffd3bad2b359c489ac4c51c25e',
             'path': 'full/84e0d100fc90acf2c0cfe8c38454a305e23b7408.jpg',
             'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'}],

[[edited for length]]


2017-12-26 09:31:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3038,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 115519,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 11,
 'file_count': 11,
 'file_status_count/uptodate': 11,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 12, 26, 17, 31, 35, 820684),
 'item_scraped_count': 11,
 'log_count/DEBUG': 86,
 'log_count/ERROR': 1,
 'log_count/INFO': 43,
 'memusage/max': 79556608,
 'memusage/startup': 66007040,
 'response_received_count': 11,
 'scheduler/dequeued': 11,
 'scheduler/dequeued/memory': 11,
 'scheduler/enqueued': 11,
 'scheduler/enqueued/memory': 11,
 'start_time': datetime.datetime(2017, 12, 26, 17, 29, 33, 860768)}
2017-12-26 09:31:35 [scrapy.core.engine] INFO: Spider closed (finished)

I checked this我检查了这个

len(film_page_urls_startpage)

and I get only 11, not 23.而我只得到 11,而不是 23。

If I use xpath('//article/a/@href') then I get 23 urls.如果我使用xpath('//article/a/@href')然后我得到 23 个网址。

There is no need to add @class .无需添加@class There is no other article .没有其他article


EDIT:编辑:

If I do如果我做

for item in sel.xpath('//article/@class').extract():
    print('class:', item)

then I get然后我得到

class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even

So some items have even in class string and this was your problem.所以有些项目even在类字符串中,这是你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM