简体   繁体   English

从 scrapy 请求中打印“响应”

[英]printing 'response' from scrapy request

i am trying to learn scrapy and while following a tutorial, i am trying to make minor adjustments.我正在尝试学习 scrapy,并且在按照教程进行操作时,我正在尝试进行微小的调整。

I want to simply get the response content from a request.我只想从请求中获取响应内容。 i will then pass the response into the tutorial code but I am unable to make a request and get the content of response .然后我会将响应传递到教程代码中,但我无法发出请求并获取响应的内容 advise would nice建议会很好

from scrapy.http import Response

url = "https://www.myUrl.com"
response = Response(url=url)
print response # <200 myurl.com> 

# but i want the content! and I cant find the method

Scrapy is a bit of complicated framework. Scrapy 是一个有点复杂的框架。 You can't just create a requests and responses in the way you want to here.您不能只是按照您想要的方式在此处创建请求和响应。
Scrapy is split into several parts, like Downloader part which downloads requests schedules in Scheduler part - in short you'd need to start all those parts as well in your code to simply get a request like that. Scrapy 分为几个部分,比如下载器部分,它在调度器部分下载请求计划——简而言之,你需要在代码中启动所有这些部分,才能简单地获得这样的请求。

You can see illustration and description of whole complex architecture here您可以在此处查看整个复杂架构的插图和描述

在此处输入图像描述

What you can do though is simply use scrapy shell command which downloads url content and lets you interact with it:你可以做的只是简单地使用scrapy shell命令来下载 url 内容并让你与之交互:

$ scrapy shell "http://stackoverflow.com"
....
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f14d9fef5f8>
[s]   item       {}
[s]   request    <GET http://stackoverflow.com>
[s]   response   <200 http://stackoverflow.com>
[s]   settings   <scrapy.settings.Settings object at 0x7f14d8d0f9e8>
[s]   spider     <DefaultSpider 'default' at 0x7f14d8af4f28>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: len(response.body)
Out[1]: 244649

Another alternative is just write a spider and inject inspect_response() into your parse function.另一种选择是编写一个蜘蛛并将inspect_response()注入到您的解析函数中。

import scrapy 
from scrapy.shell import inspect_response

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://stackoverflow.com',]

    def parse(self, response):
        inspect_response(response, self)
        # shell will open up here just like from the first example

If you just want to print all the content:如果您只想打印所有内容:

print response.text

I Agree with the things that granito.. to point out, saying Scrapy can be summed up is still a tough bet even if just covering just the framework itself... You will only understand better as you go through your tutorials, Look the best Learning Resource you have is you logic.我同意 granito.. 指出的事情,说 Scrapy 可以总结仍然是一个艰难的赌注,即使只涵盖框架本身......你只会在你完成教程时理解得更好,看起来最好你拥有的学习资源就是你的逻辑。 dedication and Google.奉献精神和谷歌。 By your code snippet there I can tell your coming from using some bs4 which is great.通过你的代码片段,我可以告诉你来自使用一些很棒的 bs4。 you can use in in a scrapy spider... I can tell that you really just started learning.,.你可以在 scrapy spider 中使用...我可以看出你真的刚刚开始学习.. like recently, not defining a class of spider or naming it and dude!就像最近一样,没有定义一类蜘蛛或命名它和伙计! nothing wrong with that!没有错!

As far as to your question about getting the content, again it's going over any scrapy tutorial written... datamining/scrapy is 99.9% just this, selcting what date HOW?至于你关于获取内容的问题,它再次回顾了任何编写的 scrapy 教程......数据挖掘/scrapy 是 99.9%,选择什么日期如何?

USING the pages CSS elemnts in your spider of which you define an item to >it, wit using the pages response or your mutated (your new changed >version off) you can the export it out as a yield or return function.. printing is usualy done more so for log puposes this elmente might be a link, just text... a file??在您的蜘蛛中使用您定义项目的页面 CSS 元素>它,使用页面响应或您的突变(您的新更改>版本关闭)您可以将其导出为产量或返回函数..打印是通常为日志 puposes 做更多,这个 elmente 可能是一个链接,只是文本...一个文件?

Using xpath in the same fashion as css but their structured different以与 css 相同的方式使用 xpath,但它们的结构不同

using regular expression will become an almost certain must, but lets take baby steps.使用正则表达式几乎可以肯定是必须的,但让我们采取一些小步骤。

... the entirety of data-mining IS to extract your content, I feel as if Id be robbing you of your own moment so tell you wham. ... 整个数据挖掘是为了提取您的内容,我觉得我好像在抢夺您自己的时间,所以请告诉您。 Do the tutorial from from the official scrapy docs, referred to as the quotes tutorial... and if you have ANY still question how what happened in that tutorial Ill share my class course (of which I get paid for... but no for u for free..) on this inro step... but man... its basically little knowledge of css.. how to use a web browser inspect tools, or old school it and just view source.. I really wish I could help, my nerd sense are tingling but I can take you moment of epiphany... bet some will... but you gain nothing right?从官方的 scrapy 文档中学习教程,称为引述教程……如果您仍然有任何疑问该教程中发生了什么,我将分享我的课程(我得到报酬……但没有你免费..)在这个inro步骤......但是伙计......它基本上对css知之甚少..如何使用网络浏览器检查工具,或者老派它并查看源代码..我真的希望我能帮助,我的书呆子感觉很刺痛,但我可以带你顿悟的时刻......打赌一些会......但你什么也没得到,对吗?

PS:附言:

as to your first question to get the content.... like, all?关于您获取内容的第一个问题……例如,全部? the entire html?整个html? the body, just all links, or just the links that contain X.. Lets say were talking about a simple blog page... has the article title date, links images inside.正文,只是所有链接,或者只是包含 X 的链接。假设我们在谈论一个简单的博客页面......有文章标题日期,里面有链接图像。 This im sure you know, just when you say page content you are refereing to the entirty of the page.我确定您知道,就在您说页面内容时,您指的是页面的整体。 The data that you mine will only be as valuable as the format you can the express it as and more importantly use the data against other to create an analysis... a conclusion based on data lol if you want just the entire html source the like our friend Granito-Whachamacallhim.... response.body您挖掘的数据的价值取决于您可以表达的格式,更重要的是将数据与其他数据进行对比来创建分析……如果您只想要整个 html 源等,则基于数据的结论大声笑我们的朋友 Granito-Whachamacallhim.... response.body

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM