繁体   English   中英

为什么scrapy 在我的本地站点上找不到任何东西?

[英]Why isn't scrapy finding anything on my local site?

我有一个在http://service.localhost:8021上运行的本地站点,我正在尝试从站点上抓取图像链接 (src attr)。 当我抓取它时,它似乎确实可以访问它(因为我收到了 200 响应); 但没有返回链接。

我的脚本是:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class crawlImages(CrawlSpider):
    name = 'crawlImages'
    allowed_domains = ["service.localhost"]
    start_urls = ['http://service.localhost:8021']
    def parse(self, response):

    titles = response.css('img::attr(alt)').extract()
    links = response.css('img::attr(src)').extract()
    print('##########')
    for item in zip(titles, links):
        all_items = {
            'title' : BeautifulSoup(item[0]).text,
            'link' :  item[1]
        }
        print(item[1])
        
        yield all_items

我像这样运行它:

scrapy runspider crawlImages.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False

我得到的答复是:已删除,因为不允许在此处发布。

任何提示?

您可以打印“标题”和“链接”以进行调试,并且您的函数缩进不正确。

改变

def parse(self, response):

titles = response.css('img::attr(alt)').extract()
links = response.css('img::attr(src)').extract()
print('##########')
for item in zip(titles, links):
    all_items = {
        'title' : BeautifulSoup(item[0]).text,
        'link' :  item[1]
    }
    print(item[1])
    
    yield all_items

def parse(self, response):
    titles = response.css('img::attr(alt)').extract()
    links = response.css('img::attr(src)').extract()
    print('##########')
    for item in zip(titles, links):
        all_items = {
            'title' : BeautifulSoup(item[0]).text,
            'link' :  item[1]
        }
        print(item[1])
        
        yield all_items

检查 response.body 与:

print(response.body)

尝试重写循环:

def parse(self, response):
    print(response.body)
    for img in response.xpath('//img'):
        title = img.xpath('./@alt').get()
        link = img.xpath('./@src').get()
        item = {}
        item['title'] = title
        item['link'] = link
        print(item)
        yield item

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM