简体   繁体   中英

Web data scraping using Scrapy

I am using scrapy to scrape justdial.com but the code doesn't seem to work. Please help me fix this. I run it with the command being "scrapy crawl justdial -o items.csv -t csv" from the terminal.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from justdial_sample.items import JustdialSampleItem

class MySpider(CrawlSpider):
    name = "justdial"
    allowed_domains = ["www.justdial.com"]
    start_urls = ["https://www.justdial.com/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html",
    ),restrict_xpaths=('//p[@id="nextpage"]',))
    , callback="parse_items", follow=True),
    )  

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
        items = []
        for titles in titles:
            item = JdItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("@/href").extract()
            item.append(item)
        return items

this is the code that I used.

If you can show output log it will be easier to help you.

Usually you should add correct headers as in browser, etc User-Agent. You can check all headers in firebug and you can check headers in your spider.

print response.request.headers

UPDATE: You should add this row to settings.py

USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/53.0'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM