简体   繁体   English

Python沙皮蜘蛛

[英]Python scrapy spider

I want to scrape data from a website http://www.quoka.de/immobilien/bueros-gewerbeflaechen with this filter: 我想使用以下过滤器从http://www.quoka.de/immobilien/bueros-gewerbeflaechen网站抓取数据:

<a class="t-bld" rel="nofollow" href="javascript:qsn.set('classtype','of',1);">nur Angebote</a>

How to set this filter using scrapy? 如何使用scrapy设置此过滤器?

You can parse a specific website by using Beautifulsoup and urllib2 . 您可以使用Beautifulsoupurllib2解析特定的网站。 Here is a python implementation for the data that you wanted to parse or scrape according to the filter you wrote. 这是您要根据编写的过滤器解析或抓取的数据的python实现。

from BeautifulSoup import BeautifulSoup
import urllib2

def main1(website):
    data_list = []
    web =urllib2.urlopen(website).read()
    soup = BeautifulSoup(web)
    description = soup.findAll('a', attrs={'rel':'nofollow'})
    for de in description:
        data_list.append(de.text)
    return data_list

print main1("http://www.quoka.de/immobilien/bueros-gewerbeflaechen")

If you wanted to parse other data such as the description from the following: 如果要解析其他数据,例如来自以下内容的描述:

在此处输入图片说明

def main(website):
    data_list = []
    web =urllib2.urlopen(website).read()
    soup = BeautifulSoup(web)
    description = soup.findAll('div', attrs={'class':'description'})
    for de in description:
        data_list.append(de.text)
    return data_list

print main("http://www.quoka.de/immobilien/bueros-gewerbeflaechen") #this is the data of each section

One way is by submitting a request with the parameter and parse the result of the response. 一种方法是通过使用参数提交请求并解析响应结果。 See the following code sample: 请参见以下代码示例:

import scrapy

class TestSpider(scrapy.Spider):

    name = 'quoka'
    start_urls = ['http://www.quoka.de/immobilien/bueros-gewerbeflaechen']

    def parse(self, response):

        request = scrapy.FormRequest.from_response(
            response,
            formname='frmSearch',
            formdata={'classtype': 'of'},
            callback=self.parse_filtered
        )
        # print request.body
        yield request

    def parse_filtered(self,response):

        searchResults = response.xpath('//div[@id="ResultListData"]/ul/li')
        for result in searchResults:
            title = result.xpath('.//div[@class="q-col n2"]/a/@title').extract()
            print title

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM