I am trying to use urlparse.urljoin
within a Scrapy spider to compile a list of urls to scrape. Currently, my spider is returning nothing, but not throwing any errors. So I am trying to check that I am compiling the urls corectly.
My attempt was to test this in idle using str.join
, as below:
>>> href = ['lphs.asp?id=598&city=london',
'lphs.asp?id=480&city=london',
'lphs.asp?id=1808&city=london',
'lphs.asp?id=1662&city=london',
'lphs.asp?id=502&city=london',]
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, x)
print(final_url)
A one line of what that returns:
lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n
I think that from my example it is obvious that str.join
does not behave in the same way - if it does then there this is why my spider is not following these links! - however, it would be good to have confirmation on that.
If this is not the right way to test, how can I test this process?
Update Attempt using urlparse.urljoin
below: from urllib.parse import urlparse
>>> from urllib.parse import urlparse
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base, x)
print(final_url)
Which is throwing AttributeError: 'function' object has no attribute 'urljoin'
Update - the spider function in question
def parse_links(self, response):
room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms
for link in room_links:
base_url = "http://www.example.com/followthrough"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
# This is not joing the final_url right
yield Request(final_url, callback=parse_links)
Update
I just tested again in idle:
>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
'lphs.asp?id=1706&city=london',
'lphs.asp?id=1826&city=london',
'lphs.asp?id=541&city=london',
'lphs.asp?id=1672&city=london',
'lphs.asp?id=509&city=london',
'lphs.asp?id=428&city=london',
'lphs.asp?id=614&city=london',
'lphs.asp?id=336&city=london',
'lphs.asp?id=412&city=london',
'lphs.asp?id=611&city=london',]
>>> for link in room_links:
base_url = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
Which threw this:
Traceback (most recent call last):
File "<pyshell#34>", line 3, in <module>
final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'
You see the output given because of this:
for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, href) # <-- 'x' instead of 'href' probably intended here
print(final_url)
urljoin
from the urllib
library behaves differently, just see the documentation. It's not simple string concatenation.
EDIT: Based on your comment, I suppose you are using Python 3. With that import statement, you import a urlparse
function. That's why you get that error. Either import and use directly the function:
from urllib.parse import urljoin
...
final_url = urljoin(base, x)
or import parse
module and use the function like this:
from urllib import parse
...
final_url = parse.urljoin(base, x)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.