繁体   English   中英

如何让Web刮得更快?

[英]How to make Web Scraping faster?

我制作了这段代码来从一个通知艺术家和音乐名称的网站中提取歌词。

代码正在运行,问题是我有一个包含 10000 首音乐的 DataFrame(命名为 years_1920_2020),检索所有这些歌词需要 1:30 小时。

有没有办法更快地做到这一点?

def url_lyric(music,artist):
 url_list = ("https://www.letras.mus.br/", str(artist),"/", str(music),"/")
 url = ''.join(url_list)
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 try:
   webpage = urlopen(req).read()
   bs = BeautifulSoup(webpage, 'html.parser')
   lines =bs.find('div', {'class':'cnt-letra p402_premium'})
   final_lines = lines.find_all('p')
   return final_lines
 except:
     return 0


final_lyric_series = pd.Series(name = "lyrics")

for year in range (1920,2021):
  lyrics_serie = lyrics_from_year(year)
  final_lyric_series = pd.concat([final_lyric_series, lyrics_serie])
  print(year)

function Lyrics_from_year(year) 使用 function url_lyric,执行一些重新任务并返回包含所选年份的所有歌词的 pd.series

我们可以使用 pythons asyncio 模块获得解决方案。 请参考这篇文章这不是一个精确的解决方案,但与您的问题相似。

import asyncio
from concurrent.futures import ThreadPoolExecutor
import pandas as pd


def url_lyric(music, artist):
    pass


def lyrics_from_year(year):
    music = None
    artist = None
    return url_lyric(music, artist)


async def get_work_done():
    with ThreadPoolExecutor(max_workers=10) as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                executor,
                lyrics_from_year,
                *(year)  # Allows us to pass in arguments to `lyrics_from_year`
            )
            for year in range(1920, 2021)
        ]

    return await asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_work_done())
loop.run_until_complete(future)

final_lyric_series = pd.Series(name="lyrics")


for result in future:
    final_lyric_series = pd.concat([final_lyric_series, result])
    print(result)

这是一个简单的示例,说明您可以如何做到这一点:

import aiohttp
import asyncio
import requests, bs4

async def main():
    async with aiohttp.ClientSession() as session:
        urls = [f"https://www.letras.mus.br{x['href']}" for x in bs4.BeautifulSoup(requests.get(
            url = 'https://www.letras.mus.br/adele/mais-tocadas.html'
        ).content, 'html.parser').find_all('a', {'class':'song-name'})]

        for url in urls:
            async with session.get(url) as r:
                lyrics = bs4.BeautifulSoup(await r.text(), 'html.parser').find('div', {'class':'cnt-letra'}).text
                print('\n'.join(x.strip() for x in lyrics.strip().split('\n')))

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM