I made a scraper for yellow pages. There is a categories.txt file that is read by the script and then it generates links according to those categories:
settings = get_project_settings()
categories = settings.get('CATEGORIES')
links = []
for category in categories:
link = 'https://www.yellowpages.com/search?search_terms=' + category + '&geo_location_terms=NY&page=1'
links.append(link)
then this links list is passed to start urls:
class YpSpider(CrawlSpider):
def __init__(self, *a, **kw):
super().__init__(*a, **kw)
dispatcher.connect(self.spider_closed, signals.spider_closed)
name = 'yp'
allowed_domains = ['yellowpages.com']
start_urls = links
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item',
follow=True),
Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''),
follow=True),
)
it will save all the data from all the links in a csv file named parent.csv. This parent.csv file will have a column named keyword which will help in seperating data from different categories and make seperate csv files for each of them. This is implemented in spider closed function:
def spider_closed(self, spider):
with open('parent.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
with open('{}.csv'.format(row[0]), 'a') as f:
writer = csv.writer(f)
writer.writerow(row)
The problem i am facing is to get the category name in my parse method corresponding to every link so that it may be saved in parent.csv and used to seperate diiferent categories afterwards:
def parse_item(self, response):
item = YellowItem()
item['keyword'] = # here i need the corresponding category for every link
I think you should change the way you generate links. You can, for example, override the start_requests method and pass the category to the request through either it's cb_kwargs or meta attribute. I would also suggest, that you change the implementation to get the settings from the crawler calling the spider by overriding from_crawler . Here's how I would do it:
class YpSpider(CrawlSpider):
def __init__(self, crawler, *args, **kwargs):
super().__init__(*args, **kwargs)
self.crawler = crawler
self.categories = crawler.settings.get('CATEGORIES')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
for category in self.categories:
yield Request(
f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms=NY&page=1',
self.parse,
cb_kwargs={'category': category}
)
def parse(self, response, category):
# do sth
...
for url in response.xpath('//a[@class="business-name"]/@href').extract():
yield Request(
url,
self.parse_item,
cb_kwargs={'category': category}
)
yield Request(
response.xpath('//a[@class="next ajax-page"]/@href').extract_first(),
self.parse,
cb_kwargs={'category': category}
)
def parse_item(self, response, category):
item = YellowItem()
item['keyword'] = category
# do sth else
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.