[英]Saving data to seprate csv files in scrapy
我為黃頁做了一個刮板。 腳本會讀取一個 categories.txt 文件,然后根據這些類別生成鏈接:
settings = get_project_settings()
categories = settings.get('CATEGORIES')
links = []
for category in categories:
link = 'https://www.yellowpages.com/search?search_terms=' + category + '&geo_location_terms=NY&page=1'
links.append(link)
然后將此鏈接列表傳遞給啟動 url:
class YpSpider(CrawlSpider):
def __init__(self, *a, **kw):
super().__init__(*a, **kw)
dispatcher.connect(self.spider_closed, signals.spider_closed)
name = 'yp'
allowed_domains = ['yellowpages.com']
start_urls = links
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item',
follow=True),
Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''),
follow=True),
)
它將所有鏈接的所有數據保存在名為 parent.csv 的 csv 文件中。 這個 parent.csv 文件將有一個名為關鍵字的列,這將有助於從不同類別中分離數據並為每個類別制作單獨的 csv 文件。 這是在蜘蛛關閉的 function 中實現的:
def spider_closed(self, spider):
with open('parent.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
with open('{}.csv'.format(row[0]), 'a') as f:
writer = csv.writer(f)
writer.writerow(row)
我面臨的問題是在我的解析方法中獲取與每個鏈接對應的類別名稱,以便它可以保存在 parent.csv 中,然后用於分隔不同的類別:
def parse_item(self, response):
item = YellowItem()
item['keyword'] = # here i need the corresponding category for every link
我認為您應該更改生成鏈接的方式。 例如,您可以覆蓋start_requests方法並通過它的cb_kwargs或元屬性將類別傳遞給請求。 我還建議您更改實現以通過覆蓋from_crawler從調用蜘蛛的爬蟲獲取設置。 這是我的做法:
class YpSpider(CrawlSpider):
def __init__(self, crawler, *args, **kwargs):
super().__init__(*args, **kwargs)
self.crawler = crawler
self.categories = crawler.settings.get('CATEGORIES')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
for category in self.categories:
yield Request(
f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms=NY&page=1',
self.parse,
cb_kwargs={'category': category}
)
def parse(self, response, category):
# do sth
...
for url in response.xpath('//a[@class="business-name"]/@href').extract():
yield Request(
url,
self.parse_item,
cb_kwargs={'category': category}
)
yield Request(
response.xpath('//a[@class="next ajax-page"]/@href').extract_first(),
self.parse,
cb_kwargs={'category': category}
)
def parse_item(self, response, category):
item = YellowItem()
item['keyword'] = category
# do sth else
...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.