简体   繁体   English

Scrapy - urlparse.urljoin 的行为方式与 str.join 相同吗?

[英]Scrapy - Does urlparse.urljoin behave in the same way as str.join?

I am trying to use urlparse.urljoin within a Scrapy spider to compile a list of urls to scrape.我正在尝试在 Scrapy 蜘蛛中使用urlparse.urljoin来编译要抓取的 url 列表。 Currently, my spider is returning nothing, but not throwing any errors.目前,我的蜘蛛没有返回任何内容,但没有抛出任何错误。 So I am trying to check that I am compiling the urls corectly.所以我试图检查我是否正确地编译了 url。

My attempt was to test this in idle using str.join , as below:我的尝试是使用str.join在空闲状态下对此进行测试,如下所示:

>>> href = ['lphs.asp?id=598&city=london',
 'lphs.asp?id=480&city=london',
 'lphs.asp?id=1808&city=london',
 'lphs.asp?id=1662&city=london',
 'lphs.asp?id=502&city=london',]
>>> for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, x)
    print(final_url)

A one line of what that returns:一行返回的内容:

lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n

I think that from my example it is obvious that str.join does not behave in the same way - if it does then there this is why my spider is not following these links!我认为从我的示例中可以明显看出str.join行为方式不同 - 如果是这样,那么这就是为什么我的蜘蛛没有遵循这些链接的原因! - however, it would be good to have confirmation on that. - 但是,最好对此进行确认。

If this is not the right way to test, how can I test this process?如果这不是正确的测试方法,我该如何测试这个过程?

Update Attempt using urlparse.urljoin below: from urllib.parse import urlparse使用更新尝试urlparse.urljoin如下:从进口的urllib.parse里urlparse

    >>> from urllib.parse import urlparse
    >>> for x in href:
        base = "http:/www.url-base.com/destination/"
        final_url = urlparse.urljoin(base, x)
        print(final_url)

Which is throwing AttributeError: 'function' object has no attribute 'urljoin'这是抛出AttributeError: 'function' object has no attribute 'urljoin'

Update - the spider function in question更新 - 有问题的蜘蛛功能

def parse_links(self, response): 
    room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms 
    for link in room_links:
        base_url = "http://www.example.com/followthrough"
        final_url = urlparse.urljoin(base_url, link)
        print(final_url)
        # This is not joing the final_url right
        yield Request(final_url, callback=parse_links)

Update更新

I just tested again in idle:我刚刚在空闲时再次测试:

>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
 'lphs.asp?id=1706&city=london',
 'lphs.asp?id=1826&city=london',
 'lphs.asp?id=541&city=london',
 'lphs.asp?id=1672&city=london',
 'lphs.asp?id=509&city=london',
 'lphs.asp?id=428&city=london',
 'lphs.asp?id=614&city=london',
 'lphs.asp?id=336&city=london',
 'lphs.asp?id=412&city=london',
 'lphs.asp?id=611&city=london',]
>>> for link in room_links:
    base_url = "http:/www.url-base.com/destination/"
    final_url = urlparse.urljoin(base_url, link)
    print(final_url)

Which threw this:哪个抛出了这个:

Traceback (most recent call last):
  File "<pyshell#34>", line 3, in <module>
    final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'

You see the output given because of this:您会看到因此给出的输出:

for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, href)   # <-- 'x' instead of 'href' probably intended here
    print(final_url)

urljoin from the urllib library behaves differently, just see the documentation. urllib库中的urljoin行为不同,请参阅文档。 It's not simple string concatenation.这不是简单的字符串连接。

EDIT: Based on your comment, I suppose you are using Python 3. With that import statement, you import a urlparse function.编辑:根据您的评论,我想您使用的是 Python 3。使用该 import 语句,您导入了一个urlparse函数。 That's why you get that error.这就是你得到那个错误的原因。 Either import and use directly the function:要么直接导入并使用该函数:

from urllib.parse import urljoin
...
final_url = urljoin(base, x)

or import parse module and use the function like this:或导入parse模块并使用如下函数:

from urllib import parse
...
final_url = parse.urljoin(base, x)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM