Stuck Scraping Multiple Domains sequentially - Python Scrapy

Question

I am fairly new to python as well as web scraping. My first project is web scraping random Craiglist cities (5 cities total) under the transportation sub-domain (ie https://dallas.craigslist.org ), though I am stuck on having to manually run the script per city after manually updating each cities respective domain under the constants >>>> (start_urls = and absolute_next_url = ) in the script. Is there anyway that I can adjust the script to sequentially run through the cities I have defined (ie miami, new york, houston, chicago, etc), and auto-populate the constants for its respective city (start_urls = and absolute_next_url = )?

Also, is there a way to adjust the script to output each city into its own.csv >> (ie miami.csv, houston.csv, chicago.csv, etc)?

Thank you in advance

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["craigslist.org"]
    start_urls = ['https://dallas.craigslist.org/d/transportation/search/trp']

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            listing_title = job.xpath('a/text()').extract_first()
            city = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
            job_posting_date = job.xpath('time/@datetime').extract_first()
            job_posting_url = job.xpath('a/@href').extract_first()
            data_id = job.xpath('a/@data-id').extract_first()


            yield Request(job_posting_url, callback=self.parse_page, meta={'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'data_id':data_id})

        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = "https://dallas.craigslist.org" + relative_next_url

        yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        job_posting_url = response.meta.get('job_posting_url')
        listing_title = response.meta.get('listing_title')
        city = response.meta.get('city')
        job_posting_date = response.meta.get('job_posting_date')
        data_id = response.meta.get('data_id')

        description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract()).strip()

        compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
        employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()
        latitude = response.xpath('//div/@data-latitude').extract_first()
        longitude = response.xpath('//div/@data-longitude').extract_first()
        posting_id = response.xpath('//p[@class="postinginfo"]/text()').extract()


        #yield{'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'description':description, #'compensation':compensation, 'employment_type':employment_type, 'posting_id':posting_id, 'longitude':longitude, 'latitude':latitude }

        yield{'job_posting_url':job_posting_url,
                      'data_id':data_id,
                'listing_title':listing_title,
                         'city':city,
                  'description':description,
                 'compensation':compensation,
              'employment_type':employment_type,
                     'latitude':latitude,
                    'longitude':longitude,
             'job_posting_date':job_posting_date,
                   'posting_id':posting_id,
                      'data_id':data_id
              }

Answer 1

There might be a cleaner way but check out https://docs.scrapy.org/en/latest/topics/practices.html?highlight=multiple%20spiders and you can basically combine multiple instances of your spider together, so you can have a separate 'class' for each city. There are probably some ways to consolidate some code so it's not all repeated.

As for writing to csv, are you doing that via the command line right now? I'd add the code to the spider itself https://realpython.com/python-csv/

Stuck Scraping Multiple Domains sequentially - Python Scrapy

Question

1 answers

solution1
0 ACCPTED 2019-09-26 11:53:31

Stuck Scraping Multiple Domains sequentially - Python Scrapy

Question

1 answers

solution1 0 ACCPTED 2019-09-26 11:53:31

solution1
0 ACCPTED 2019-09-26 11:53:31