简体   繁体   English

Scrapy Start_URL不正确

[英]Scrapy Start_URL not correct

So i'm new to scrapy and I'm coming to an issue where (I believe) the start URL isn't correct. 因此,我是新手,我遇到了一个(我相信)起始URL不正确的问题。

Then click the links to load into the camps description 然后单击链接以加载到营地说明中

However when I use that start URL it doesnt load. 但是,当我使用该起始URL时,它不会加载。 Meaning That scrapy opens and loads the telnet but will never connect. 意思是,scrapy打开并加载了telnet,但是永远不会连接。 When I use http://www.w3.org/1999/xhtml (which I get from the top line in inspect (chrome) It crawls but seems that is the completely wrong site. ( I got this link from the top of the inspect page) 当我使用http://www.w3.org/1999/xhtml时 (我从inspect(chrome)的第一行得到它爬网,但看来这是完全错误的网站。(我从检查页面)

and where it SHOULD start, the URL is: http://www.kidscamps.com/camps/california-overnight-camps-page0.html 网址应从以下位置开始: http//www.kidscamps.com/camps/california-overnight-camps-page0.html

Any ideas? 有任何想法吗? and Thanks in advance! 并预先感谢! Sorry about all the commented out 对不起所有评论

So I guess my biggest question is how do I find the CORRECT url to start with since all my other scripts work correctly. 所以我想我最大的问题是,既然我所有其他脚本都能正常工作,那么如何找到正确的URL开头。

Also it doesn't work without rules assigned. 同样,如果没有分配规则,它将无法正常工作。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from kidscamp_com.items import KidscampComItem
import html2text


class MySpider(CrawlSpider):
    name = "kids"
    #allowed_domains = "http://www.bayareaparent.com/Camp-Guide/index.php/cp/1/si/0/"
    start_urls = ['http://www.kidscamps.com/residential/overnight_camp.html'
    ]

    rules = (
    Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[@id="results-wrapper"]/div[1]/p[1]/a',)), callback="parse1", follow=True),
     )


    def parse1(self, response):
        hxs = Selector(response)
        body = hxs.xpath('//*[@id="body-wrapper"]')
        items = []
        for body in body:
            item = KidscampComItem()
         #   item["camp_name"] = body.xpath('').extract()
          #  item["location"] = body.xpath('').extract()
            item["phone"] = body.xpath('//a[@class="phone"]//text()').extract()
            item["website"] = body.xpath('//*[@id="results-wrapper"]/div[1]/div/div[2]/ul[2]/li[2]/a').extract()
           # item["email"] = body.xpath('').extract()
            item["description"] = body.xpath('//*[@id="info-page"]/div[2]/div//text()').extract()
            item["camp_size"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[1]/li[1]/dd').extract()
            item["founded"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[1]/li[2]/dd').extract()
            item["gender"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[1]/li[3]/dd').extract()
            item["maximum_age"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[2]/li[1]/dd').extract()
            item["minimum_age"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[2]/li[2]/dd').extract()
            item["nearest_city"] = body.xpath('//*[@id="info-page"]/div[2]/div/ul[2]/li[3]/dd').extract()
            items.append(item)
            return items

Checked out robots.txt which should allow crawling over most of their site. 已检出robots.txt,该文件应允许对其大部分网站进行爬网。 However after reading into the source a little more I noticed this line: 但是,在阅读了源代码后,我注意到了这一行:

does that mean that even though its not on /robots.txt its still not considered allowed? 这是否意味着即使不在/robots.txt上,也仍然不被允许? I even tried without listening to robots (to see if anything changed) and Nothing different happened. 我什至试着不听机器人(看看是否有任何变化),没有发生任何不同的事情。 But if someone does know that answer that would be cool. 但是,如果有人知道答案,那将是很酷的。

UPDATE 更新

Found out that when i changed: start_urls to start_url it works. 发现当我更改时:start_urls到start_url起作用。 The thing thats weird about this is that I have used start_urls for my other spiders and it works regardless of the (s). 奇怪的是,我为其他蜘蛛使用了start_urls,并且不管(s)都可以工作。 Wonder why its changing anything here 想知道为什么它在这里改变了什么

Bo scrapy standard spider class scrapy.spiders.Spider and the class scrapy.spiders.CrawlSpider use the attribute start_urls . Bo scrapy标准蜘蛛类scrapy.spiders.Spiderscrapy.spiders.CrawlSpider类使用属性start_urls

From the official documentation : 官方文档中

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    ...

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/1.html']
    ...

The attribute start_url is not used anywhere. 属性start_url不在任何地方使用。

Seems like website you're crawling doesn't work well with Scrapy default user agent. 似乎您要爬网的网站无法与Scrapy默认用户代理一起正常运行。

Make sure site is ok with you crawling them, if they are ok agree some UA with them so that they know it's you. 确保您可以在网站上进行爬网,如果可以,请与他们同意一些UA,以便他们知道是您。 Setting user agent in scrapy is matter of setting user_agent spider attribute, for example: 设置用户代理时应设置user_agent spider属性,例如:

class MySpider(Spider):
    user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM