简体   繁体   中英

Scraping Multiple Websites with Single Spider using Scrapy

I am using Scrapy to scrape data from this website . Following is the code for spider .

class StackItem(scrapy.Item):
def __setitem__(self, key, value):
    if key not in self.fields:
        self.fields[key] = scrapy.Field()
    self._values[key] = value

class betaSpider(CrawlSpider):
    name = "betaSpider"


    def __init__(self, *args, **kwargs): 
        super(betaSpider, self).__init__(*args, **kwargs) 
        self.start_urls = [kwargs.get('start_url')]

    rules = (Rule (LinkExtractor(unique=True, allow=('.*\?id1=.*',),restrict_xpaths=('//a[@class="prevNext next"]',)), callback="parse_items", follow= True),)

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        posts = hxs.select("//article[@class='classified']")
        items = []

        for post in posts:
            item = StackItem()
            item["job_role"] = post.select("div[@class='uu mb2px']/a/strong/text()").extract()
            item["company"] = post.select("p[1]/text()").extract()
            item["location"] = post.select("p[@class='mb5px b red']/text()").extract()
            item["desc"] = post.select("details[@class='aj mb10px']/text()").extract()
            item["read_more"] = post.select("div[@class='uu mb2px']/a/@href").extract()
            items.append(item)
            for item in items:
                yield item

This is the code for item pipelines:

class myExporter(object):

def __init__(self):
    self.myCSV = csv.writer(open('out.csv', 'wb'))
    self.myCSV.writerow(['Job Role', 'Company','Location','Description','Read More'])

def process_item(self, item, spider):
    self.myCSV.writerow([item['job_role'], item['company'], item['location'], item['desc'], item['read_more']])

    return item

This is working fine. Now, i have to scrape following websites (for example) using same spider.

  1. http://www.freejobalert.com/government-jobs/
  2. https://www.sarkariexaam.com/

I have to scrape all the tags of the above mentioned websites, store it to a CSV file using item pipelines.

Actually, the list of websites to be scrapped is endless. In this project, user will enter the url and scrapped results will be returned back to that user. So, i want a generic spider which can scrape any website.

For a single website, it is working fine. But, how can it be accomplished for multiple site having different structure ? Is Scrapy enough to solve it?

使用不同的Spider会更好。您可以使用API​​从脚本运行Scrapy,而不是运行Scrapy爬网的典型方式。请记住,Scrapy构建在Twisted异步网络库的顶部,因此您需要在Twisted反应器中运行。

I think you have to create a generic spider to scrape data from differet websites.That can be done by adding website one by one to the spider and generalizing the code.

It will be very vast code if the websites are entirely different. From your requirments the we can generalize the code and make a spider that give above details if any website is given.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM