[英]Scrapy web crawling going bad
I'm new to scrapy and trying to understand it by scraping yellowpages.com website. 我是新手,还是尝试通过刮除yellowpages.com网站来了解它。
My objective is to write a python code to enter the search fields (business and the location) of the yellowpages.com homepage and then scrape the subsequent urls. 我的目标是编写一个python代码,以输入yellowpages.com主页的搜索字段(业务和位置),然后抓取后续的网址。
My code looks like this : 我的代码如下所示:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"query":"business",
"location" : "78735" },
callback=self.after_results
)
def after_results(self, response):
self.logger.info("info msg")
I want to search for "business" at location "78735". 我想在“ 78735”位置搜索“业务”。 However, these are not the values that are passed to the website. 但是,这些不是传递到网站的值。 My log looks like this : 我的日志如下:
2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/> (referer: None)
2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/search?search_terms=&geo_location_terms=Los+Angeles%2C+CA&query=business&location=78735> (referer: http://www.yellowpages.com/)
In the second url, the terms Los+Angeles is inserted somehow. 在第二个URL中,以某种方式插入了术语Los + Angeles。 When I try to enter the search fields manually and submit, this is how the url supposed to look like this: 当我尝试手动输入搜索字段并提交时,URL应该是这样的:
http://www.yellowpages.com/search?search_terms=business&geo_location_terms=78735
Can someone tell me what's going wrong and how to fix it? 有人可以告诉我哪里出了什么问题以及如何解决?
Thanks a lot. 非常感谢。
Just for reference, here is the part of the HTML source code of the yellowpages.com home page 仅供参考,这是yellowpages.com主页的HTML源代码的一部分
<div class="search-bar"><form id="search-form" action="/search" method="GET"><div><label><span>What do you want to find?</span><input id="query" type="text" value="" placeholder="What do you want to find?" autocomplete="off" data-onempty="recent-searches" name="search_terms" tabindex="1"/></label><ul id="recent-searches" class="search-dropdown recent-searches"><li class="search-hint">Search by<b> business name,</b> or<b> keyword</b></li></ul><ul id="autosuggest-term" data-analytics='{"moi":105}' class="search-dropdown autosuggest-term"></ul></div><em>near</em><div><label><span>Where?</span> <input id="location"type="text" value="78735" placeholder="Where?" autocomplete="off" data-onempty="menu-location" name="geo_location_terms" tabindex="2"/></label>
Set the search_terms
and geo_location_terms
form parameters: 设置search_terms
和geo_location_terms
表单参数:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms": "business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
Tested with the following spider: 经过以下蜘蛛测试:
import scrapy
from scrapy.spiders import Spider
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms":"business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
def after_results(self, response):
for result in response.css("div.result a[itemprop=name]::text").extract():
print(result)
Prints the list of businesses in "Austin, TX": 打印“德克萨斯州奥斯丁”的业务列表:
Prism Solutions
Time Agent
Stuart Consulting
Jones REX L
Medical Informatics & Tech Inc
J E Andrews INC
...
Hicks Consulting
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.