I get a url that cannot be used to fetch data from next page, so created a base_url = 'http://www.marinetraffic.com'
variable and passed it scrapy request. port_homepage_url = base_url + port_homepage_url
. It works fine, when i yeild the result like this. yield {'a': port_homepage_url, 'b':item['port_name']}
I get this result i wanted.
however if place it in scrapy request yield scrapy.Request(port_homepage_url, callback=self.parse, meta={'item': item})
i get error
port_homepage_url = base_url + port_homepage_url
TypeError: cannot concatenate 'str' and 'NoneType' objects
here is code
class GetVessel(scrapy.Spider):
name = "getvessel"
allowed_domains = ["marinetraffic.com"]
start_urls = [
'http://www.marinetraffic.com/en/ais/index/ports/all/flag:AE',
]
def parse(self, response):
item = VesseltrackerItem()
base_url = 'http://www.marinetraffic.com'
for ports in response.xpath('//table/tr[position()>1]'):
item['port_name'] = ports.xpath('td[2]/a/text()').extract_first()
port_homepage_url = ports.xpath('td[7]/a/@href').extract_first()
port_homepage_url = base_url + port_homepage_url
yield scrapy.Request(port_homepage_url, callback=self.parse, meta={'item': item})
The problem does not happen on the initial start URL page, but happens later on when subsequent requests are processed. Take for example this page . There are no links in the 7-th td
element and, hence, ports.xpath('td[7]/a/@href').extract_first()
returns None
which results in a failure on the port_homepage_url = base_url + port_homepage_url
line.
How to approach the problem depends on what were you planning to do on the "port" pages. From what I understand, you did not mean to actually handle the "port" page requests with self.parse
and need to have a separate callback with different logic inside.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.