[英]How to get rid of exceptions.TypeError error?
I am writing a scraper using Scrapy. 我正在使用Scrapy编写一个刮刀。 One of the things I want it to do is to compare the root domain of the current webpage and the root domain of the links within it.
我希望它做的一件事是比较当前网页的根域和其中链接的根域。 If this domains are different, then it has to proceed extracting data.
如果这些域不同,则必须继续提取数据。 This is my current code:
这是我目前的代码:
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
def parse(self, response):
items = []
for link in response.xpath("//a"):
#Extract the root domain for the main website from the canonical URL
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname
#Extract the root domain for thelink
hostname2 = link.xpath('@href').extract()
hostname2 = urlparse(hostname2).hostname
#Compare if the root domain of the website and the root domain of the link are different.
#If so, extract the items & build the dictionary
if hostname1 != hostname2:
item = SocialMediaItem()
item['SourceTitle'] = link.xpath('/html/head/title').extract()
item['TargetTitle'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
items.append(item)
return items
However, when I run it I get this error: 但是,当我运行它时,我收到此错误:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
hostname1 = urlparse(hostname1).hostname
File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'
Can anyone help me to get rid of this error? 任何人都可以帮助我摆脱这个错误? I interpret that it has something to do with list keys, but I don't know how to solve it.
我认为它与列表键有关,但我不知道如何解决它。 Thanks you so much!
非常感谢!
Dani 达尼
There are a few things wrong here: 这里有一些问题:
There is no need to calculate hostname1
in the loop, since it always selects the same rel
element, even though used on a sub-selector (due to the nature of the xpath expression, which is absolute rather than relative, but this is the way you need it to be). 没有必要在循环中计算
hostname1
,因为它总是选择相同的rel
元素,即使在子选择器上使用(由于xpath表达式的性质,这是绝对的而不是相对的,但这是方式你需要它)。
The xpath expression for hostname1
is malformed and it returns None, thus the error when trying to get only the first element as proposed by Kevin. hostname1
的xpath表达式格式错误,它返回None,因此在尝试仅获取Kevin提出的第一个元素时出错。 You have two single-qoutes in the expression, instead of one escaped single-quote or a double-quote. 表达式中有两个单qoutes,而不是一个转义的单引号或双引号。
You are getting the rel
element itself, when you should be getting its @href
attribute. 当您应该获取其
@href
属性时,您将获得rel
元素本身。 The XPath expression should be altered to reflect this. 应该更改XPath表达式以反映这一点。
After resolving these issues, the code could look something like this (not tested): 解决这些问题后,代码看起来像这样(未经测试):
def parse(self, response):
items = []
hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
hostname1 = urlparse(hostname1).hostname
for link in response.xpath("//a"):
hostname2 = (link.xpath('@href').extract() or [''])[0]
hostname2 = urlparse(hostname2).hostname
#Compare and extract
if hostname1 != hostname2:
...
return items
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname
extract
returns a list of strings, but urlparse
accepts only one string. extract
返回一个字符串列表,但urlparse
只接受一个字符串。 Perhaps you should discard all but the first hostname found. 也许你应该丢弃除了找到的第一个主机名以外的所有主机名
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0]
hostname1 = urlparse(hostname1).hostname
And likewise for the other hostname. 同样对于其他主机名。
hostname2 = link.xpath('@href').extract()[0]
hostname2 = urlparse(hostname2).hostname
If you're not certain whether the document even has a hostname, it may be useful to look before you leap. 如果您不确定文档是否具有主机名,那么在跳跃之前查看可能会很有用。
hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
if not hostname1: continue
hostname1 = urlparse(hostname1[0]).hostname
hostname2 = link.xpath('@href').extract()
if not hostname2: continue
hostname2 = urlparse(hostname2[0]).hostname
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.