简体   繁体   English

在scrapy蜘蛛中分裂变量

[英]Splitting variables in scrapy spider

forgive me, I am a total programming noob. 原谅我,我是一个总编程菜鸟。

I am trying to extract a record id from a url with the following code and Im running into trouble. 我正在尝试使用以下代码从URL中提取记录ID,而Im遇到了麻烦。 If I run it through the shell it seems to work fine (no errors) but when I run it through scrapy the framework generates errors 如果我通过shell运行它似乎工作正常(没有错误)但是当我通过scrapy运行它时框架会产生错误

Example: 例:
if the url is http://domain.com/path/to/record_id=1599 如果网址http://domain.com/path/to/record_id=1599
then record_link = /path/to/record_id=1599 然后record_link = / path / to / record_id = 1599
therefore record_id should be = 1599 因此record_id应为= 1599

   for site in sites:

      record_link = site.select('div[@class="description"]/h4/a/@href').extract()
      record_id = record_link.strip().split('=')[1]

      item['link'] = record_link
      item['id'] = record_id
      items.append(item)

any help is greatly appreciated 任何帮助是极大的赞赏

EDIT:: 编辑::

Scrapy errors like something like this: Scrapy错误像这样:

   root@web01:/home/user/spiderdir/spiderdir/spiders# sudo scrapy crawl spider
   2012-02-23 09:47:16+1100 [scrapy] INFO: Scrapy 0.13.0.2839 started (bot: spider)
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled item pipelines:
   2012-02-23 09:47:16+1100 [spider] INFO: Spider opened
   2012-02-23 09:47:16+1100 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6031
   2012-02-23 09:47:16+1100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6088
   2012-02-23 09:47:19+1100 [spider] DEBUG: Crawled (200) <GET http://www.domain.com/path/to/> (referer: None)
   2012-02-23 09:47:21+1100 [spider] DEBUG: Crawled (200) <GET http://www.domain.com/path/to/record_id=2> (referer: http://www.domain.com/path/to/)
   2012-02-23 09:47:21+1100 [spider] ERROR: Spider error processing <GET http://www.domain.com/path/to/record_id=2>
   Traceback (most recent call last):
      File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 577, in _tick
        taskObj._oneWorkUnit()
      File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 458, in _oneWorkUnit
        result = self._iterator.next()
      File "/usr/lib/pymodules/python2.6/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/usr/lib/pymodules/python2.6/scrapy/utils/defer.py", line 96, in iter_errback
        yield it.next()
      File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/offsite.py", line 24, in process_spider_output
        for x in result:
      File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/referer.py", line 14, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/urllength.py", line 32, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/depth.py", line 56, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.6/scrapy/contrib/spiders/crawl.py", line 66, in _parse_response
        cb_res = callback(response, **cb_kwargs) or ()
      File "/home/nick/googledir/googledir/spiders/google_directory.py", line 36, in parse_main
        record_id = record_link.split("=")[1]
    exceptions.AttributeError: 'list' object has no attribute 'split'

` `

Kind of a long shot since you didn't post your errors, but I'm guessing you will have to change this line: 因为你没有发布你的错误,有点远景,但我猜你将不得不改变这一行:

record_id = record_link.strip().split('=')[1]

to

record_id = record_link[0].strip().split('=')[1]

Since HtmlXPathSelector always returns a list of selected items. 由于HtmlXPathSelector始终返回所选项的列表。

I think what I'm after is something like this: 我想我追求的是这样的:

for site in sites:

      record_link = site.select('div[@class="description"]/h4/a/@href').extract()
      record_id = [i.split('=')[1] for i in record_link]

  item['link'] = record_link
  item['id'] = record_id
  items.append(item)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM