简体   繁体   中英

printing 'response' from scrapy request

i am trying to learn scrapy and while following a tutorial, i am trying to make minor adjustments.

I want to simply get the response content from a request. i will then pass the response into the tutorial code but I am unable to make a request and get the content of response . advise would nice

from scrapy.http import Response

url = "https://www.myUrl.com"
response = Response(url=url)
print response # <200 myurl.com> 

# but i want the content! and I cant find the method

Scrapy is a bit of complicated framework. You can't just create a requests and responses in the way you want to here.
Scrapy is split into several parts, like Downloader part which downloads requests schedules in Scheduler part - in short you'd need to start all those parts as well in your code to simply get a request like that.

You can see illustration and description of whole complex architecture here

在此处输入图像描述

What you can do though is simply use scrapy shell command which downloads url content and lets you interact with it:

$ scrapy shell "http://stackoverflow.com"
....
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f14d9fef5f8>
[s]   item       {}
[s]   request    <GET http://stackoverflow.com>
[s]   response   <200 http://stackoverflow.com>
[s]   settings   <scrapy.settings.Settings object at 0x7f14d8d0f9e8>
[s]   spider     <DefaultSpider 'default' at 0x7f14d8af4f28>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: len(response.body)
Out[1]: 244649

Another alternative is just write a spider and inject inspect_response() into your parse function.

import scrapy 
from scrapy.shell import inspect_response

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://stackoverflow.com',]

    def parse(self, response):
        inspect_response(response, self)
        # shell will open up here just like from the first example

If you just want to print all the content:

print response.text

I Agree with the things that granito.. to point out, saying Scrapy can be summed up is still a tough bet even if just covering just the framework itself... You will only understand better as you go through your tutorials, Look the best Learning Resource you have is you logic. dedication and Google. By your code snippet there I can tell your coming from using some bs4 which is great. you can use in in a scrapy spider... I can tell that you really just started learning.,. like recently, not defining a class of spider or naming it and dude! nothing wrong with that!

As far as to your question about getting the content, again it's going over any scrapy tutorial written... datamining/scrapy is 99.9% just this, selcting what date HOW?

USING the pages CSS elemnts in your spider of which you define an item to >it, wit using the pages response or your mutated (your new changed >version off) you can the export it out as a yield or return function.. printing is usualy done more so for log puposes this elmente might be a link, just text... a file??

Using xpath in the same fashion as css but their structured different

using regular expression will become an almost certain must, but lets take baby steps.

... the entirety of data-mining IS to extract your content, I feel as if Id be robbing you of your own moment so tell you wham. Do the tutorial from from the official scrapy docs, referred to as the quotes tutorial... and if you have ANY still question how what happened in that tutorial Ill share my class course (of which I get paid for... but no for u for free..) on this inro step... but man... its basically little knowledge of css.. how to use a web browser inspect tools, or old school it and just view source.. I really wish I could help, my nerd sense are tingling but I can take you moment of epiphany... bet some will... but you gain nothing right?

PS:

as to your first question to get the content.... like, all? the entire html? the body, just all links, or just the links that contain X.. Lets say were talking about a simple blog page... has the article title date, links images inside. This im sure you know, just when you say page content you are refereing to the entirty of the page. The data that you mine will only be as valuable as the format you can the express it as and more importantly use the data against other to create an analysis... a conclusion based on data lol if you want just the entire html source the like our friend Granito-Whachamacallhim.... response.body

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM