Python沙皮蜘蛛

Question

我想使用以下過濾器從http://www.quoka.de/immobilien/bueros-gewerbeflaechen網站抓取數據：

<a class="t-bld" rel="nofollow" href="javascript:qsn.set('classtype','of',1);">nur Angebote</a>

如何使用scrapy設置此過濾器？

Answer 1

您可以使用Beautifulsoup和urllib2解析特定的網站。 這是您要根據編寫的過濾器解析或抓取的數據的python實現。

from BeautifulSoup import BeautifulSoup
import urllib2

def main1(website):
    data_list = []
    web =urllib2.urlopen(website).read()
    soup = BeautifulSoup(web)
    description = soup.findAll('a', attrs={'rel':'nofollow'})
    for de in description:
        data_list.append(de.text)
    return data_list

print main1("http://www.quoka.de/immobilien/bueros-gewerbeflaechen")

如果要解析其他數據，例如來自以下內容的描述：

在此處輸入圖片說明

def main(website):
    data_list = []
    web =urllib2.urlopen(website).read()
    soup = BeautifulSoup(web)
    description = soup.findAll('div', attrs={'class':'description'})
    for de in description:
        data_list.append(de.text)
    return data_list

print main("http://www.quoka.de/immobilien/bueros-gewerbeflaechen") #this is the data of each section

Answer 2

一種方法是通過使用參數提交請求並解析響應結果。 請參見以下代碼示例：

import scrapy

class TestSpider(scrapy.Spider):

    name = 'quoka'
    start_urls = ['http://www.quoka.de/immobilien/bueros-gewerbeflaechen']

    def parse(self, response):

        request = scrapy.FormRequest.from_response(
            response,
            formname='frmSearch',
            formdata={'classtype': 'of'},
            callback=self.parse_filtered
        )
        # print request.body
        yield request

    def parse_filtered(self,response):

        searchResults = response.xpath('//div[@id="ResultListData"]/ul/li')
        for result in searchResults:
            title = result.xpath('.//div[@class="q-col n2"]/a/@title').extract()
            print title

Python沙皮蜘蛛

問題描述

2 個解決方案

解決方案1
4 2015-06-27 22:59:25

解決方案2
1 已采納 2015-06-28 16:47:30

Python沙皮蜘蛛

問題描述

2 個解決方案

解決方案1 4 2015-06-27 22:59:25

解決方案2 1 已采納 2015-06-28 16:47:30

解決方案1
4 2015-06-27 22:59:25

解決方案2
1 已采納 2015-06-28 16:47:30