简体   繁体   中英

Scrapy - Does urlparse.urljoin behave in the same way as str.join?

I am trying to use urlparse.urljoin within a Scrapy spider to compile a list of urls to scrape. Currently, my spider is returning nothing, but not throwing any errors. So I am trying to check that I am compiling the urls corectly.

My attempt was to test this in idle using str.join , as below:

>>> href = ['lphs.asp?id=598&city=london',
>>> for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, x)

A one line of what that returns:


I think that from my example it is obvious that str.join does not behave in the same way - if it does then there this is why my spider is not following these links! - however, it would be good to have confirmation on that.

If this is not the right way to test, how can I test this process?

Update Attempt using urlparse.urljoin below: from urllib.parse import urlparse

    >>> from urllib.parse import urlparse
    >>> for x in href:
        base = "http:/www.url-base.com/destination/"
        final_url = urlparse.urljoin(base, x)

Which is throwing AttributeError: 'function' object has no attribute 'urljoin'

Update - the spider function in question

def parse_links(self, response): 
    room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms 
    for link in room_links:
        base_url = "http://www.example.com/followthrough"
        final_url = urlparse.urljoin(base_url, link)
        # This is not joing the final_url right
        yield Request(final_url, callback=parse_links)


I just tested again in idle:

>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
>>> for link in room_links:
    base_url = "http:/www.url-base.com/destination/"
    final_url = urlparse.urljoin(base_url, link)

Which threw this:

Traceback (most recent call last):
  File "<pyshell#34>", line 3, in <module>
    final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'

You see the output given because of this:

for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, href)   # <-- 'x' instead of 'href' probably intended here

urljoin from the urllib library behaves differently, just see the documentation. It's not simple string concatenation.

EDIT: Based on your comment, I suppose you are using Python 3. With that import statement, you import a urlparse function. That's why you get that error. Either import and use directly the function:

from urllib.parse import urljoin
final_url = urljoin(base, x)

or import parse module and use the function like this:

from urllib import parse
final_url = parse.urljoin(base, x)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM