使用從頁面中抓取的字符串生成start_urls列表，以便使用scrapy進一步抓取

Question

請幫忙，

我從房地產網站的搜索結果頁面收集了大量字符串，這些字符串與porperty id相對應。 該站點使用屬性ID來命名包含有關我要收集的各個屬性的信息的頁面。

如何將我的第一個蜘蛛創建的網址列表添加到另一個蜘蛛的start_urls中？

謝謝 - 我是新的。

Answer 1

沒有必要有兩只蜘蛛。 蜘蛛可以yield一個scrapy.http.Request對象與自定義的回調，以允許基於從一組初始頁面解析出的值，以刮掉其他頁面。

我們來看一個例子：

from scrapy.spider import BaseSpider
from scrapy.http import Request    

class SearchSpider(BaseSpider):
   ...
   start_urls = ['example.com/list_of_links.html']
   ...

   # Assume this is your "first" spider's parse method
   # It parses your initial search results page and generates a
   # list of URLs somehow.
   def parse(self, response):
     hxs = HtmlXPathSelector(response)
     # For example purposes we just take every link
     for href in hxs.select('//a/@href]).extract():
       yield Request(href[0], callback=self.parse_search_url)

   def parse_search_url(self, response):
      # Here is where you would put what you were thinking of as your
      # "second" spider's parse method. It operates on the results of the
      # URLs scraped in the first parse method.
      pass

正如您在此示例中所看到的，SearchSpider.parse方法解析“搜索結果頁面”（或其他任何內容），並為其找到的每個URL生成一個請求。 因此，不是將這些URL寫入文件並嘗試將它們用作第二個蜘蛛的start_urls，而是將回調設置為同一個蜘蛛中的另一個方法（此處為：parse_search_url）。

希望這可以幫助。

Answer 2

作為一名研究員，我明白在Scrapy中很難掌握yield方法。 如果你沒有設法獲得上面的方法@audiodude詳細信息（這是由於多種原因更好的方法），我使用的'解決方法'是通過使用以下方法生成我的URL（在LibreOffice或Excel中） Concatenate函數為每一行添加正確的標點符號。 然后只需將它們復制並粘貼到我的蜘蛛中即可

start_urls = [
  "http://example.com/link1",
  "http://example.com/link2",
  "http://example.com/link3",
  "http://example.com/link4",
  "http://example.com/link5",
  "http://example.com/link6",
  "http://example.com/link7",
  "http://example.com/link8",
  "http://example.com/link9"
  ]

請注意，每行后都需要逗號（最后一行除外），並且每個鏈接必須用直引號括起來。 使用Concatenate時使用引號會很痛苦，因此要在與您的url相鄰的單元格中生成所需的結果，請鍵入=Concatenate(CHAR(34),A2,CHAR(34),",")假設您的url在A2單元格中。

祝好運。

使用從頁面中抓取的字符串生成start_urls列表，以便使用scrapy進一步抓取

問題描述

2 個解決方案

解決方案1
1 已采納 2013-10-15 23:58:43

解決方案2
1 2013-10-16 09:00:11

使用從頁面中抓取的字符串生成start_urls列表，以便使用scrapy進一步抓取

問題描述

2 個解決方案

解決方案1 1 已采納 2013-10-15 23:58:43

解決方案2 1 2013-10-16 09:00:11

解決方案1
1 已采納 2013-10-15 23:58:43

解決方案2
1 2013-10-16 09:00:11