简体   繁体   中英

Using strings scraped from a page to generate a list of start_urls for further scraping using scrapy

please help,

I have collected a whole lot of strings from a realestate website's search results page that corresponds to porperty ids. The site uses property ids to name pages containing information about individual properties that I want to collect.

How can I get my list of urls created by my first spider into the start_urls of another spider?

Thanks - im new.

There's no need to have two spiders. A spider can yield a scrapy.http.Request object with a custom callback, to allow additional pages to be scraped based on the values parsed from an initial set of pages.

Let's look at an example:

from scrapy.spider import BaseSpider
from scrapy.http import Request    

class SearchSpider(BaseSpider):
   ...
   start_urls = ['example.com/list_of_links.html']
   ...

   # Assume this is your "first" spider's parse method
   # It parses your initial search results page and generates a
   # list of URLs somehow.
   def parse(self, response):
     hxs = HtmlXPathSelector(response)
     # For example purposes we just take every link
     for href in hxs.select('//a/@href]).extract():
       yield Request(href[0], callback=self.parse_search_url)

   def parse_search_url(self, response):
      # Here is where you would put what you were thinking of as your
      # "second" spider's parse method. It operates on the results of the
      # URLs scraped in the first parse method.
      pass

As you see in this example, the SearchSpider.parse method parses the "search results page" (or whatever) and yields a Request for each URL that it finds. So instead of writing those URLs to a file and trying to use them as the start_urls for a second spider, simply yield them with the callback set to another method in the same spider (here: parse_search_url).

Hope this helps.

As a fellow noob, I understand that it can be difficult to get your head around the yield method in Scrapy. If you don't manage to get the method @audiodude details above (which is the better way to scrape for a number of reasons), a 'workaround' I have used is to produce my urls (in LibreOffice or Excel) by employing the Concatenate function to add the correct punctuation for each line. Then simply copy and paste them into my spider eg

start_urls = [
  "http://example.com/link1",
  "http://example.com/link2",
  "http://example.com/link3",
  "http://example.com/link4",
  "http://example.com/link5",
  "http://example.com/link6",
  "http://example.com/link7",
  "http://example.com/link8",
  "http://example.com/link9"
  ]

Note that you need a comma after every line (except the last one) and each link must be enclosed in straight quotation quotation marks. It's a pain working with quotation marks when using Concatenate , so to produce the desired result, in a cell adjacent to your url, type =Concatenate(CHAR(34),A2,CHAR(34),",") assuming that your url is in cell A2 .

Good luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM