简体   繁体   中英

TypeError: cannot concatenate 'str' and 'NoneType' objects when placing the custom url in scrapy.Request()

I get a url that cannot be used to fetch data from next page, so created a base_url = 'http://www.marinetraffic.com' variable and passed it scrapy request. port_homepage_url = base_url + port_homepage_url . It works fine, when i yeild the result like this. yield {'a': port_homepage_url, 'b':item['port_name']} I get this result i wanted.

http://www.marinetraffic.com/en/ais/index/ships/range/port_id:20585/port_name:FUJAIRAH%20ANCH,FUJAIRAH ANCH

however if place it in scrapy request yield scrapy.Request(port_homepage_url, callback=self.parse, meta={'item': item}) i get error

port_homepage_url = base_url +  port_homepage_url
TypeError: cannot concatenate 'str' and 'NoneType' objects

here is code

class GetVessel(scrapy.Spider):
    name = "getvessel"
    allowed_domains = ["marinetraffic.com"]
    start_urls = [
        'http://www.marinetraffic.com/en/ais/index/ports/all/flag:AE',
    ]


    def parse(self, response):
        item = VesseltrackerItem()
        base_url = 'http://www.marinetraffic.com'
        for ports in response.xpath('//table/tr[position()>1]'):
            item['port_name'] = ports.xpath('td[2]/a/text()').extract_first()
            port_homepage_url = ports.xpath('td[7]/a/@href').extract_first()
            port_homepage_url = base_url +  port_homepage_url
            yield scrapy.Request(port_homepage_url, callback=self.parse, meta={'item': item})

The problem does not happen on the initial start URL page, but happens later on when subsequent requests are processed. Take for example this page . There are no links in the 7-th td element and, hence, ports.xpath('td[7]/a/@href').extract_first() returns None which results in a failure on the port_homepage_url = base_url + port_homepage_url line.

How to approach the problem depends on what were you planning to do on the "port" pages. From what I understand, you did not mean to actually handle the "port" page requests with self.parse and need to have a separate callback with different logic inside.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM