使用Scrapy進行Python數據抓取

Question

我想從一個有TextFields，Buttons等的網站上抓取數據。我的要求是填寫文本字段並提交表單以獲得結果，然后從結果頁面中抓取數據點。

我想知道Scrapy是否具有此功能，或者如果有人可以推薦Python中的庫來完成此任務？

（編輯）的
我想從以下網站獲取數據：
http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

我的要求是從ComboBoxes中選擇值並點擊搜索按鈕並從結果頁面中抓取數據點。

PS我正在使用selenium Firefox驅動程序從其他網站獲取數據，但該解決方案並不好，因為selenium Firefox驅動程序依賴於FireFox的EXE，即必須在運行刮刀之前安裝Firefox。

Selenium Firefox驅動程序為一個實例消耗大約100MB內存，我的要求是一次運行大量實例以使抓取過程快速，因此也存在內存限制。

Firefox在執行刮刀時有時會崩潰，不知道為什么。 此外，我需要窗口少刮，這在Selenium Firefox驅動程序的情況下是不可能的。

我的最終目標是在Heroku上運行刮刀，我在那里擁有Linux環境，因此selenium Firefox驅動程序無法在Heroku上運行。 謝謝

Answer 1

基本上，您有很多工具可供選擇：

scrapy
beautifulsoup
LXML
機械化
請求（和問候）
硒
ghost.py

這些工具有不同的用途，但可以根據任務將它們混合在一起。

Scrapy是一種功能強大且非常智能的工具，用於抓取網站，提取數據。 但是，當涉及到操縱頁面時：單擊按鈕，填寫表單 - 它變得更加復雜：

有時候，通過直接在scrapy中制作底層表單動作，可以很容易地模擬填寫/提交表單
有時，你必須使用其他工具來幫助scrapy - 比如機械化或硒

如果您更具體地提出問題，那將有助於了解您應該使用或選擇哪種工具。

看看有趣的scrapy和selenium混合物的例子。 在這里，selenium任務是單擊按鈕並為scrapy項提供數據：

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

更新：

以下是關於如何在您的案例中使用scrapy的示例：

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

將它保存在spider.py並通過scrapy runspider spider.py -o output.json ，在output.json你會看到：

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

希望有所幫助。

Answer 2

如果您只是想提交表單並從結果頁面中提取數據，我會選擇：

請求發送帖子請求
美麗的湯從結果頁面中提取所選數據

Scrapy增加的價值確實在於它能夠跟蹤鏈接並抓取網站，如果你確切地知道你在尋找什么，我認為它不適合這項工作。

Answer 3

我本人會使用機械化，因為我沒有任何scrapy經驗。 但是，為屏幕抓取而構建的名為scrapy目的的庫應該可以完成任務。 我會和他們兩個人一起去看看哪個工作最好/最簡單。

使用Scrapy進行Python數據抓取

問題描述

3 個解決方案

解決方案1
17 已采納 2013-05-28 08:06:26

解決方案2
3 2013-05-28 07:13:01

解決方案3
2 2013-05-28 07:05:21

使用Scrapy進行Python數據抓取

問題描述

3 個解決方案

解決方案1 17 已采納 2013-05-28 08:06:26

解決方案2 3 2013-05-28 07:13:01

解決方案3 2 2013-05-28 07:05:21

解決方案1
17 已采納 2013-05-28 08:06:26

解決方案2
3 2013-05-28 07:13:01

解決方案3
2 2013-05-28 07:05:21