如何摆脱exceptions.TypeError错误？

Question

I am writing a scraper using Scrapy. 我正在使用Scrapy编写一个刮刀。 One of the things I want it to do is to compare the root domain of the current webpage and the root domain of the links within it. 我希望它做的一件事是比较当前网页的根域和其中链接的根域。 If this domains are different, then it has to proceed extracting data. 如果这些域不同，则必须继续提取数据。 This is my current code: 这是我目前的代码：

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            #Extract the root domain for the main website from the canonical URL
            hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
            hostname1 = urlparse(hostname1).hostname
            #Extract the root domain for thelink
            hostname2 = link.xpath('@href').extract()
            hostname2 = urlparse(hostname2).hostname
            #Compare if the root domain of the website and the root domain of the link are different.
            #If so, extract the items & build the dictionary 
            if hostname1 != hostname2:
                item = SocialMediaItem()
                item['SourceTitle'] = link.xpath('/html/head/title').extract()
                item['TargetTitle'] = link.xpath('text()').extract()
                item['link'] = link.xpath('@href').extract()
                items.append(item)
        return items

However, when I run it I get this error: 但是，当我运行它时，我收到此错误：

Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
    hostname1 = urlparse(hostname1).hostname
  File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
    cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'

Can anyone help me to get rid of this error? 任何人都可以帮助我摆脱这个错误？ I interpret that it has something to do with list keys, but I don't know how to solve it. 我认为它与列表键有关，但我不知道如何解决它。 Thanks you so much! 非常感谢！

Dani 达尼

Answer 1

There are a few things wrong here: 这里有一些问题：

There is no need to calculate hostname1 in the loop, since it always selects the same rel element, even though used on a sub-selector (due to the nature of the xpath expression, which is absolute rather than relative, but this is the way you need it to be). 没有必要在循环中计算hostname1 ，因为它总是选择相同的rel元素，即使在子选择器上使用（由于xpath表达式的性质，这是绝对的而不是相对的，但这是方式你需要它）。
The xpath expression for hostname1 is malformed and it returns None, thus the error when trying to get only the first element as proposed by Kevin. hostname1的xpath表达式格式错误，它返回None，因此在尝试仅获取Kevin提出的第一个元素时出错。 You have two single-qoutes in the expression, instead of one escaped single-quote or a double-quote. 表达式中有两个单qoutes，而不是一个转义的单引号或双引号。
You are getting the rel element itself, when you should be getting its @href attribute. 当您应该获取其@href属性时，您将获得rel元素本身。 The XPath expression should be altered to reflect this. 应该更改XPath表达式以反映这一点。

After resolving these issues, the code could look something like this (not tested): 解决这些问题后，代码看起来像这样（未经测试）：

    def parse(self, response):
        items = []
        hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
        hostname1 = urlparse(hostname1).hostname

        for link in response.xpath("//a"):
            hostname2 = (link.xpath('@href').extract() or [''])[0]
            hostname2 = urlparse(hostname2).hostname
            #Compare and extract
            if hostname1 != hostname2:
                ...
        return items

Answer 2

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname

extract returns a list of strings, but urlparse accepts only one string. extract返回一个字符串列表，但urlparse只接受一个字符串。 Perhaps you should discard all but the first hostname found. 也许你应该丢弃除了找到的第一个主机名以外的所有主机名

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0]
hostname1 = urlparse(hostname1).hostname

And likewise for the other hostname. 同样对于其他主机名。

hostname2 = link.xpath('@href').extract()[0]
hostname2 = urlparse(hostname2).hostname

If you're not certain whether the document even has a hostname, it may be useful to look before you leap. 如果您不确定文档是否具有主机名，那么在跳跃之前查看可能会很有用。

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
if not hostname1: continue
hostname1 = urlparse(hostname1[0]).hostname

hostname2 = link.xpath('@href').extract()
if not hostname2: continue
hostname2 = urlparse(hostname2[0]).hostname

如何摆脱exceptions.TypeError错误？

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-12-05 09:52:33

解决方案2
1 2014-12-01 15:46:18

如何摆脱exceptions.TypeError错误？

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-12-05 09:52:33

解决方案2 1 2014-12-01 15:46:18

解决方案1
2 已采纳 2014-12-05 09:52:33

解决方案2
1 2014-12-01 15:46:18