[英]Pass variable to test.py in spider folder using scrapy
我正在使用Scrapy。 以下是Spider文件夹中test.py
的代码。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://seattle.craigslist.org/npo/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
items.append(item)
return items
从本质上讲,我想重复我的网址列表,并通过链接进入MySpider
类start_ulrs
。 有人可以给我建议如何做吗?
无需“静态定义” start_urls
您需要重写start_requests()
方法:
from scrapy.http import Request
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
def start_requests(self)
list_of_urls = [...] # reading urls from a text file, for example
for url in list_of_urls:
yield Request(url)
def parse(self, response):
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.