I'm new to programing on python and working with scrapy. I am in the process of crawling a web page and then saving the collection to mongoDB. I am facing an error with the web crawling. I have used similar help pages on this site and even followed a tutorial from beginning to end to no avail, any help will be appreciated.
This is the error i'm getting from terminal, Spider error processing
Here is my code:
from scrapy.item import Item, Field
#class 1
class StackItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
pagetitle = Field()
newsmain = Field()
pass
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
#class 2
class StackSpider(Spider):
name = "stack"
allowed_domains = ["docs.python.org"]
start_urls = ["https://docs.python.org/2/howto/curses.html",]
def parse(self, response):
information = Selector(response.body).xpath('//div[@class="section"]')
for data in information:
item = StackItem()
item['pagetitle'] = data.information('//*[@id="curses-programming- with-python"]').extract()
item['newsmain'] = data.information('//*[@id="what-is- curses"]').extract()
yield item
scrapy.selector.Selector.__init__()
expects a Response
object as first argument .
If you want to build a selector for an HTTP response body, use the text=
argument:
$ scrapy shell https://docs.python.org/2/howto/curses.html
2016-11-21 11:05:34 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-21 11:05:35 [scrapy] INFO: Spider opened
2016-11-21 11:05:35 [scrapy] DEBUG: Crawled (200) <GET https://docs.python.org/2/howto/curses.html> (referer: None)
(...)
>>>
>>> #
>>> # passing response.body (bytes) instead of a Response object fails
>>> #
>>> scrapy.Selector(response.body)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/selector/unified.py", line 67, in __init__
text = response.text
AttributeError: 'str' object has no attribute 'text'
>>>
>>> #
>>> # use text= argument to pass response body
>>> #
>>> scrapy.Selector(text=response.body)
<Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'>
>>>
>>> scrapy.Selector(text=response.body).xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]
>>>
An easier way is to pass the response object directly:
>>> scrapy.Selector(response).xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]
And an even easier way is to use .xpath()
method on the response
instance (it's a convenience method that creates a Selector for you), provided your response is an HtmlResponse
or XmlResponse
(which is usually the case for web scraping)
>>> response.xpath('//div[@class="section"]')
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.