簡體   English   中英

爬網爬行

[英]Scrapy web crawling going bad

我是新手,還是嘗試通過刮除yellowpages.com網站來了解它。

我的目標是編寫一個python代碼,以輸入yellowpages.com主頁的搜索字段(業務和位置),然后抓取后續的網址。

我的代碼如下所示:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website

class YellowPages(Spider):
    name = "yellow"
    allowed_domains = ["yellowpages.com"]
    start_urls = [
        "http://www.yellowpages.com/"
    ]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='search-form']",
            formdata={
                "query":"business",
                "location" : "78735" },
            callback=self.after_results
        )

    def after_results(self, response):
        self.logger.info("info msg")

我想在“ 78735”位置搜索“業務”。 但是,這些不是傳遞到網站的值。 我的日志如下:

2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/> (referer: None)

2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/search?search_terms=&geo_location_terms=Los+Angeles%2C+CA&query=business&location=78735> (referer: http://www.yellowpages.com/)

在第二個URL中,以某種方式插入了術語Los + Angeles。 當我嘗試手動輸入搜索字段並提交時,URL應該是這樣的:

http://www.yellowpages.com/search?search_terms=business&geo_location_terms=78735

有人可以告訴我哪里出了什么問題以及如何解決?

非常感謝。

僅供參考,這是yellowpages.com主頁的HTML源代碼的一部分

<div class="search-bar"><form id="search-form" action="/search" method="GET"><div><label><span>What do you want to find?</span><input id="query" type="text" value="" placeholder="What do you want to find?" autocomplete="off" data-onempty="recent-searches" name="search_terms" tabindex="1"/></label><ul id="recent-searches" class="search-dropdown recent-searches"><li class="search-hint">Search by<b> business name,</b> or<b> keyword</b></li></ul><ul id="autosuggest-term" data-analytics='{"moi":105}' class="search-dropdown autosuggest-term"></ul></div><em>near</em><div><label><span>Where?</span> <input id="location"type="text" value="78735" placeholder="Where?" autocomplete="off" data-onempty="menu-location" name="geo_location_terms" tabindex="2"/></label>

設置search_termsgeo_location_terms表單參數:

def parse(self, response):
    return scrapy.FormRequest.from_response(
        response,
        formxpath="//form[@id='search-form']",
        formdata={
            "search_terms": "business",
            "geo_location_terms" : "78735"},
        callback=self.after_results
    )

經過以下蜘蛛測試:

import scrapy
from scrapy.spiders import Spider


class YellowPages(Spider):
    name = "yellow"
    allowed_domains = ["yellowpages.com"]
    start_urls = [
        "http://www.yellowpages.com/"
    ]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='search-form']",
            formdata={
                "search_terms":"business",
                "geo_location_terms" : "78735"},
            callback=self.after_results
        )

    def after_results(self, response):
        for result in response.css("div.result a[itemprop=name]::text").extract():
            print(result)

打印“德克薩斯州奧斯丁”的業務列表:

Prism Solutions
Time Agent
Stuart Consulting
Jones REX L
Medical Informatics & Tech Inc
J E Andrews INC
...
Hicks Consulting

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM