[英]Python scrapy spider
我想使用以下过滤器从http://www.quoka.de/immobilien/bueros-gewerbeflaechen网站抓取数据:
<a class="t-bld" rel="nofollow" href="javascript:qsn.set('classtype','of',1);">nur Angebote</a>
如何使用scrapy设置此过滤器?
您可以使用Beautifulsoup
和urllib2
解析特定的网站。 这是您要根据编写的过滤器解析或抓取的数据的python实现。
from BeautifulSoup import BeautifulSoup
import urllib2
def main1(website):
data_list = []
web =urllib2.urlopen(website).read()
soup = BeautifulSoup(web)
description = soup.findAll('a', attrs={'rel':'nofollow'})
for de in description:
data_list.append(de.text)
return data_list
print main1("http://www.quoka.de/immobilien/bueros-gewerbeflaechen")
如果要解析其他数据,例如来自以下内容的描述:
def main(website):
data_list = []
web =urllib2.urlopen(website).read()
soup = BeautifulSoup(web)
description = soup.findAll('div', attrs={'class':'description'})
for de in description:
data_list.append(de.text)
return data_list
print main("http://www.quoka.de/immobilien/bueros-gewerbeflaechen") #this is the data of each section
一种方法是通过使用参数提交请求并解析响应结果。 请参见以下代码示例:
import scrapy
class TestSpider(scrapy.Spider):
name = 'quoka'
start_urls = ['http://www.quoka.de/immobilien/bueros-gewerbeflaechen']
def parse(self, response):
request = scrapy.FormRequest.from_response(
response,
formname='frmSearch',
formdata={'classtype': 'of'},
callback=self.parse_filtered
)
# print request.body
yield request
def parse_filtered(self,response):
searchResults = response.xpath('//div[@id="ResultListData"]/ul/li')
for result in searchResults:
title = result.xpath('.//div[@class="q-col n2"]/a/@title').extract()
print title
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.