[英]Web Crawler in Python for Yelp
我一直在尝试为yelp编写爬虫。 我想获取该页面上提供的供应商的链接,我知道它以href =“给出,但数组返回始终为空,请帮助!并提前谢谢:)
import urllib
import mechanize
from bs4 import BeautifulSoup
import re
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders= [('User-agent', 'chrome')]
BASE_URL = "http://www.yelp.com/"
regex = "u(?!.*u).*,"
patern =re.compile(regex)
search = "house cleaner"
location ="London, Uk"
term = search.replace(" ","+")
place = location.replace(",","%2C").replace(" ","+")
query = BASE_URL+"search?find_desc="+term+"&find_loc="+place+"&ns=1#start=0"
html = br.open(query).read()
soup = BeautifulSoup(html)
results = soup.findAll('ul',attrs={'class':'ylist ylist-bordered search-results'})
results_parse = str(results)
soup1 = BeautifulSoup(results_parse)
names =soup1.findAll("li")
for li in names:
soup2=BeautifulSoup(str(li))
links=soup2.findAll("a")
links_parse = links[0]
vendor_links=[a["href"] for a in links]
out= re.findall(patern,str(vendor_links))
print out
这是解决使您的代码执行您想要的事情的字面问题的解决方案(但是请参阅下面的内容,以了解为什么我认为这不是一个好方法):
import requests
import lxml.html
BASE_URL = "http://www.yelp.com"
search = "house cleaner"
location ="London, Uk"
term = search.replace(" ","+")
place = location.replace(",","%2C").replace(" ","+")
query = BASE_URL + "/search?find_desc="+term+"&find_loc="+place+"&ns=1#start=0"
html = requests.get(query).content
tree = lxml.html.fromstring(html)
results = tree.xpath("//span[@class='indexed-biz-name']/a[@class='biz-name']/@href")
for result in results:
print BASE_URL + result
如果您要进行更多的抓取操作,为什么我会对您的代码进行各种更改的一些说明:
但是,更一般而言,如果我想从站点提取信息,我要做的第一件事就是检查它们是否具有API 。
是的 , 我建议你用这个 。 为什么?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.