[英]Web Crawler in Python for Yelp
我一直在嘗試為yelp編寫爬蟲。 我想獲取該頁面上提供的供應商的鏈接,我知道它以href =“給出,但數組返回始終為空,請幫助!並提前謝謝:)
import urllib
import mechanize
from bs4 import BeautifulSoup
import re
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders= [('User-agent', 'chrome')]
BASE_URL = "http://www.yelp.com/"
regex = "u(?!.*u).*,"
patern =re.compile(regex)
search = "house cleaner"
location ="London, Uk"
term = search.replace(" ","+")
place = location.replace(",","%2C").replace(" ","+")
query = BASE_URL+"search?find_desc="+term+"&find_loc="+place+"&ns=1#start=0"
html = br.open(query).read()
soup = BeautifulSoup(html)
results = soup.findAll('ul',attrs={'class':'ylist ylist-bordered search-results'})
results_parse = str(results)
soup1 = BeautifulSoup(results_parse)
names =soup1.findAll("li")
for li in names:
soup2=BeautifulSoup(str(li))
links=soup2.findAll("a")
links_parse = links[0]
vendor_links=[a["href"] for a in links]
out= re.findall(patern,str(vendor_links))
print out
這是解決使您的代碼執行您想要的事情的字面問題的解決方案(但是請參閱下面的內容,以了解為什么我認為這不是一個好方法):
import requests
import lxml.html
BASE_URL = "http://www.yelp.com"
search = "house cleaner"
location ="London, Uk"
term = search.replace(" ","+")
place = location.replace(",","%2C").replace(" ","+")
query = BASE_URL + "/search?find_desc="+term+"&find_loc="+place+"&ns=1#start=0"
html = requests.get(query).content
tree = lxml.html.fromstring(html)
results = tree.xpath("//span[@class='indexed-biz-name']/a[@class='biz-name']/@href")
for result in results:
print BASE_URL + result
如果您要進行更多的抓取操作,為什么我會對您的代碼進行各種更改的一些說明:
但是,更一般而言,如果我想從站點提取信息,我要做的第一件事就是檢查它們是否具有API 。
是的 , 我建議你用這個 。 為什么?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.