简体   繁体   English

Aiohttp异步会话请求

[英]Aiohttp async session requests

So i've been scraping a website (www.cardsphere.com) protected pages with requests, using session, like so: 因此,我一直在使用会话来抓取带有请求的网站(www.cardsphere.com)受保护的页面,如下所示:

import requests

payload = {
            'email': <enter-email-here>,
            'password': <enter-site-password-here>
          }

with requests.Session() as request:
   requests.get(<site-login-page>)
   request.post(<site-login-here>, data=payload)
   request.get(<site-protected-page1>)
   save-stuff-from-page1
   request.get(<site-protected-page2>)
   save-stuff-from-page2
   .
   .
   .
   request.get(<site-protected-pageN>)
   save-stuff-from-pageN
the-end

Now since it's quite a bit of pages i wanted to speed it up with Aiohttp + asyncio...but i'm missing something. 现在,由于页面很多,我想使用Aiohttp + asyncio来加速它,但是我缺少了一些东西。 I've been able to more or less use it to scrap unprotected pages, like so: 我已经能够或多或少地使用它来刮掉未受保护的页面,如下所示:

import asyncio
import aiohttp

async def get_cards(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            data = await resp.text()
            <do-stuff-with-data>

urls  = [
         'https://www.<url1>.com'
         'https://www.<url2>.com'
         .
         .
         . 
         'https://www.<urlN>.com'
        ]

loop = asyncio.get_event_loop()
loop.run_until_complete(
    asyncio.gather(
        *(get_cards(url) for url in urls)
    )
)

That gave some results but how do i do it for pages that require login? 那给出了一些结果,但是对于需要登录的页面我该怎么办呢? I tried adding session.post(<login-url>,data=payload) inside the async function but that obviously didn't work out well, it will just keep logging in. Is there a way to "set" an aiohttp ClientSession before the loop function? 我尝试在异步函数内添加session.post(<login-url>,data=payload) ,但显然效果不佳,它将一直保持登录状态。在之前有没有办法“设置” aiohttp ClientSession循环功能? As i need to login first and then, on the same session, get data from a bunch of protected links with asyncio + aiohttp? 由于我需要先登录,然后在同一会话中,使用asyncio + aiohttp从一堆受保护的链接中获取数据?

Still rather new to python, async even more so, i'm missing some key concept here. 对于python来说还是相当新的东西,所以异步甚至更多,我在这里缺少一些关键概念。 If anybody would point me in the right direction i'll greatly appreciate it. 如果有人能指出正确的方向,我将不胜感激。

This is the simplest I can come up with, depending on what you do in <do-stuff-with-data> you may run into some other troubles regarding concurrency, down the rabbit hole you go... just kidding, its a little bit more complicated to wrap your head around coros and promises and tasks but once you get it is as simple as sequential programming 这是我能想到的最简单的方法,具体取决于您在<do-stuff-with-data>您可能会遇到其他一些有关并发的麻烦,您可能会遇到麻烦……只是在开玩笑,有点将您的头绪围绕在承诺,承诺和任务上会更加复杂,但是一旦获得,它就像顺序编程一样简单

import asyncio
import aiohttp


async def get_cards(url, session, sem):
    async with sem, session.get(url) as resp:
        data = await resp.text()
        # <do-stuff-with-data>


urls = [
    'https://www.<url1>.com',
    'https://www.<url2>.com',
    'https://www.<urlN>.com'
]


async def main():
    sem = asyncio.Semaphore(100)
    async with aiohttp.ClientSession() as session:
        await session.get('auth_url')
        await session.post('auth_url', data={'user': None, 'pass': None})
        tasks = [asyncio.create_task(get_cards(url, session, sem)) for url in urls]
        results = await asyncio.gather(*tasks)
        return results


asyncio.run(main())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM