[英]Can't propel the generated links for next page to crawl recursively
我創建的搜尋器正在從網頁中獲取名稱和網址。 現在,我完全不知道讓我的搜尋器使用next_page生成的鏈接來從下一頁獲取數據。 對於使用類創建搜尋器,我是一個新手,這是因為我無法進一步思考。 我已經采取主動對代碼稍作改動,但是它既不會帶來任何結果,也不會引發任何錯誤。 希望有人來看看它。
import requests
from lxml import html
class wiseowl:
def __init__(self,start_url):
self.start_url=start_url
self.storage=[]
def crawl(self):
self.get_link(self.start_url)
def get_link(self,link):
url="http://www.wiseowl.co.uk"
response=requests.get(link)
tree=html.fromstring(response.text)
name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
docs=(name,urls)
self.storage.append(docs)
next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem']/@href")
for npage in next_page:
if npage is not None:
self.get_link(url+npage)
def __str__(self):
return "{}".format(self.storage)
crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
print(item)
我修改了課程的某些部分,嘗試一下:
class wiseowl:
def __init__(self,start_url):
self.start_url=start_url
self.links = [ self.start_url ] # a list of links to crawl #
self.storage=[]
def crawl(self):
for link in self.links : # call get_link for every link in self.links #
self.get_link(link)
def get_link(self,link):
print('Crawling: ' + link)
url="http://www.wiseowl.co.uk"
response=requests.get(link)
tree=html.fromstring(response.text)
name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
docs=(name,urls)
#docs=(name, [url+u for u in urls]) # use this line if you want to join the urls #
self.storage.append(docs)
next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href") # get links form 'woPagingItem' or 'woPagingNext' #
for npage in next_page:
if npage and url+npage not in self.links : # don't get the same link twice #
self.links += [ url+npage ]
def __str__(self):
return "{}".format(self.storage)
crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
item = zip(item[0], item[1])
for i in item :
print('{:60} {}'.format(i[0], i[1])) # you can change 60 to the value you want #
您應該考慮利用某種數據結構來保存訪問過的鏈接(以避免無限循環)以及尚未訪問的鏈接的容器。 從本質上講,爬網是互聯網的廣度優先搜索。 因此,您應該使用Google廣度優先搜索來更好地了解底層算法。
您的搜尋器方法應類似於:
def crawler(self): while len(self.queue): curr_link = self.queue.pop(0) # process curr_link here -> scrape and add more links to queue # mark curr_link as visited
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.