[英]Pass variable to test.py in spider folder using scrapy
I'm using Scrapy. 我正在使用Scrapy。 The following is the code for
test.py
in spider folder. 以下是Spider文件夹中
test.py
的代码。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://seattle.craigslist.org/npo/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
items.append(item)
return items
Essentially, I want to iterate my url list and pass url into MySpider
class for start_ulrs
. 从本质上讲,我想重复我的网址列表,并通过链接进入
MySpider
类start_ulrs
。 Could you anyone give me suggestion on how to make this? 有人可以给我建议如何做吗?
Instead of having "statically defined" start_urls
you need to override start_requests()
method: 无需“静态定义”
start_urls
您需要重写start_requests()
方法:
from scrapy.http import Request
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
def start_requests(self)
list_of_urls = [...] # reading urls from a text file, for example
for url in list_of_urls:
yield Request(url)
def parse(self, response):
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.