简体   繁体   中英

Scrapy Spider error processing

I'm new to programing on python and working with scrapy. I am in the process of crawling a web page and then saving the collection to mongoDB. I am facing an error with the web crawling. I have used similar help pages on this site and even followed a tutorial from beginning to end to no avail, any help will be appreciated.

This is the error i'm getting from terminal, Spider error processing

Here is my code:

from scrapy.item import Item, Field

#class 1
class StackItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
pagetitle = Field()
newsmain = Field()
pass

from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem

#class 2
class StackSpider(Spider):
name = "stack"
allowed_domains = ["docs.python.org"]
start_urls = ["https://docs.python.org/2/howto/curses.html",]

def parse(self, response):
    information = Selector(response.body).xpath('//div[@class="section"]')

    for data in information:
        item = StackItem()
        item['pagetitle'] = data.information('//*[@id="curses-programming- with-python"]').extract()
        item['newsmain'] = data.information('//*[@id="what-is-  curses"]').extract()

    yield item

scrapy.selector.Selector.__init__() expects a Response object as first argument .

If you want to build a selector for an HTTP response body, use the text= argument:

$ scrapy shell https://docs.python.org/2/howto/curses.html
2016-11-21 11:05:34 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-21 11:05:35 [scrapy] INFO: Spider opened
2016-11-21 11:05:35 [scrapy] DEBUG: Crawled (200) <GET https://docs.python.org/2/howto/curses.html> (referer: None)
(...)
>>> 
>>> #
>>> # passing response.body (bytes) instead of a Response object fails
>>> #
>>> scrapy.Selector(response.body)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/selector/unified.py", line 67, in __init__
    text = response.text
AttributeError: 'str' object has no attribute 'text'
>>>
>>> #
>>> # use text= argument to pass response body
>>> #
>>> scrapy.Selector(text=response.body)
<Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'>
>>>
>>> scrapy.Selector(text=response.body).xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]
>>> 

An easier way is to pass the response object directly:

>>> scrapy.Selector(response).xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]

And an even easier way is to use .xpath() method on the response instance (it's a convenience method that creates a Selector for you), provided your response is an HtmlResponse or XmlResponse (which is usually the case for web scraping)

>>> response.xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM