简体   繁体   English

Craigslist Scraper 使用 Scrapy Spider 不执行功能

[英]Craigslist Scraper using Scrapy Spider not performing functions

2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 10:07:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=120> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 10:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=240> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa?s=120)

this is the output I get, seems like it just moves to the page of results, performed by selecting the next button and performing a request in line 27这是我得到的 output,似乎它只是移动到结果页面,通过选择下一步按钮并在第 27 行执行请求来执行

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector


class PhonesSpider(scrapy.Spider):
    name = 'phones'
    allowed_domains = ['tampa.craigslist.org']
    start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']



    def parse(self, response):
        phones = response.xpath('//p[@class="result-info"]')
        for phone in phones:
            relative_url = phone.xpath('a/@href').extract_first()
            absolute_url = response.urljoin(relative_url)
            title = phone.xpath('a/text()').extract_first()
            price = phone.xpath('//*[@id="sortable-results"]/ul/li[3]/a/span').extract_first()
            yield Request(absolute_url, callback=self.parse_item, meta={'URL': absolute_url, 'Title': title, 'price': price})
            
        
        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = "https://tampa.craigslist.org" + relative_next_url
        yield Request(absolute_next_url, callback=self.parse)

            
            
    def parse_item(self, response):
        item = CraigItem()
        item["cl_id"] = response.meta.get('Title')
        item["price"] = response.meta.get
        absolute_url = response.meta.get('URL')
        
        yield{'URL': absolute_url, 'Title': title, 'price': price}


Seems like in my code, for phone in phones loop, doesn't run, which results in never running parse_item and continuing to requesting the next url, I am following some tutorials and reading documentation but im still having trouble grasping what I am doing wrong.似乎在我的代码中,电话循环中的电话没有运行,这导致从不运行 parse_item 并继续请求下一个 url,我正在关注一些教程和阅读文档,但我仍然无法理解我做错了什么. I have experience with coding arduinos as a hobby when I was young, but no professional coding experience, this is my first forte into a project like this, I have an ok grasp on the basics of loops, functions, callbacks, etc.我年轻的时候有将arduinos编码作为业余爱好的经验,但没有专业的编码经验,这是我第一次进入这样的项目,我对循环、函数、回调等基础知识有很好的掌握。

any help is greatly appreciated任何帮助是极大的赞赏

UPDATE更新

current output当前 output

2021-05-07 15:29:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 15:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 15:29:33 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-05-07 15:29:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 15:29:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html>
{'cl_id': 'postid_7309734640',
 'price': '$35',
 'title': 'Cut that high cable bill, switch to SPC TV and save. 1400 hd '
          'channels',
 'url': 'https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html'}
2021-05-07 15:29:36 [scrapy.core.engine] INFO: Closing spider (finished)

CURRENT CODE当前代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector


class PhonesSpider(scrapy.Spider):
    name = 'phones'
    allowed_domains = ['tampa.craigslist.org']
    start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']
    base_url = 'https://tampa.craigslist.org'
    



    def parse(self, response):
        phones = response.xpath('//div[@class="result-info"]')
        
        for phone in phones:
        
            x = response.meta.get('x')
            n = -1
            
            url = response.xpath('//a[@class="result-title hdrlnk"]/@href').getall()
            relative_url = phone.xpath('//a[@class="result-title hdrlnk"]/@href').get()
            absolute_url = response.urljoin(relative_url)
            title = phone.xpath('//a[@class="result-title hdrlnk"]/text()').getall()
            price = phone.xpath('//span[@class="result-price"]/text()').getall()
            cl_id = phone.xpath('//a[@class="result-title hdrlnk"]/@id').getall()
            yield Request(absolute_url, callback=self.parse_item, meta={'absolute_url': absolute_url, 'url': url, 'title': title, 'price': price, 'cl_id': cl_id, 'n': n})
        
            
        

            
            
    def parse_item(self, response):
        n = response.meta.get('n')
        x = n + 1
        
        item = CraigItem()
        item["title"] = response.meta.get('title')[x]
        item["cl_id"] = response.meta.get('cl_id')[x]
        item["price"] = response.meta.get('price')[x]
        item["url"] = response.meta.get('url')[x]

        yield item
        
        absolute_next_url = response.meta.get('url')[x]
        absolute_url = response.meta.get('absolute_url')
        
        yield Request(absolute_next_url, callback=self.parse, meta={'x': x})


I am now able to retrieve the desired content for a posting, URL, Price, Title and craigslist id, now my spider automatically closes after pulling just 1 result, I am having trouble understanding the process of using variables between the 2 functions (x) and (n), logically, after pulling one listings data, as above in the format我现在能够检索所需的发布内容 URL、价格、标题和 craigslist id,现在我的蜘蛛在仅拉出 1 个结果后自动关闭,我无法理解在 2 个函数之间使用变量的过程 (x)和 (n),在逻辑上,在提取一个列表数据后,格式如上

cl_id Price title url cl_id 价格标题 url

I would like to proceed back to the initial parse function and swap to the next item in the list of urls retrieved by我想继续进行初始解析 function 并切换到由检索到的 url 列表中的下一项

response.xpath('//a[@class="result-title hdrlnk"]/@href').getall()

which (when run in scrapy shell, succesfully pulls all the URLs)其中(在 scrapy shell 中运行时,成功提取所有 URL)

how do I go about implementing this logic of start with [0] in the list, perform parse, perform parse_item, output item, then update a variable (n which starts as 0, needs to + 1 after each item)then call n in parse_item with its updated value and use, for example (item["title"] = response.meta.get('title')[x]) to refer to the list of urls, etc, and which place to select, then run parse_item again outputting 1 at a time, until all the values in the URL list have been output with their related price, cl_id, and title.我如何 go 关于实现列表中以 [0] 开头的逻辑,执行解析,执行 parse_item,output 项,然后更新一个变量(n 以 0 开头,每个项后需要 + 1)然后调用 n parse_item 及其更新值和使用,例如(item["title"] = response.meta.get('title')[x])引用 url 列表等,并将其放置到 select,然后运行parse_item 再次一次输出 1,直到 URL 列表中的所有值都为 output 及其相关价格、cl_id 和标题。

I know the code is messy as hell and the basics aren't fully understood by me yet, but im committed to getting this to work and learning it the hard way rather than starting from the ground up for python.我知道代码乱七八糟,而且我还没有完全理解基础知识,但我致力于让它工作并以艰难的方式学习它,而不是从头开始学习 python。

Class result-info is used within the div block, so you should write: Class result-infodiv块中使用,所以你应该写:

phones = response.xpath('//div[@class="result-info"]')

That being said, I didn't check/fix your spider further (it seems there are only parsing errors, not functional ones).话虽如此,我没有进一步检查/修复您的蜘蛛(似乎只有解析错误,而不是功能错误)。 As a suggestion for the future, you can use Scrapy shell for quickly debugging the issues:作为对未来的建议,您可以使用 Scrapy shell 来快速调试问题:

scrapy shell "your-url-here"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM