简体   繁体   English

抓取该网站的最佳方法是什么? (不是硒)

[英]What would be the best way to scrape this website? (Not Selenium)

Before I begin TLDR is at the bottom在我开始之前 TLDR 在底部

So I'm trying to scrape https://rarbgmirror.com/ for torrent magnet links and for their torrent title names based on user inputted searches.所以我试图根据用户输入的搜索来抓取https://rarbgmirror.com/的 torrent 磁力链接和他们的 torrent 标题名称。 I've already figured out how to do this using BeautifulSoup and Requests through this code:我已经想出了如何通过以下代码使用 BeautifulSoup 和 Requests 来做到这一点:

from bs4 import BeautifulSoup
import requests
import re

query = input("Input a search: ")
link = 'https://rarbgmirror.com/torrents.php?search=' + query

magnets = []
titles = []
try:
    request = requests.get(link)
except:
    print("ERROR")
source = request.text
soup = BeautifulSoup(source, 'lxml')
for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}):
    page_link = 'https://www.1377x.to/' + page_link.get('href')
    try:
        page_request = requests.get(page_link)
    except:
        print("ERROR")

    page_source = page_request.content
    page_soup = BeautifulSoup(page_source, 'lxml')
    link = page_soup.find('a', attrs={'href': re.compile("^magnet")})
    magnets.append(link.get('href'))
    title = page_soup.find('h1')
    titles.append(title)

print(titles)
print(magnets)

I am almost certain that this code has no error in it because the code was originally made for https://1377x.to for the same purpose, and if you look through the HTML structure of both websites, they use the same tags for magnet links and title names.我几乎可以肯定这段代码没有错误,因为该代码最初是为https://1377x.to制作的,目的是相同的,如果您查看两个网站的 HTML 结构,它们使用相同的标签来表示磁铁链接和标题名称。 But if the code is faulty please point that out to me!但是如果代码有问题,请指出来!

After some research I found the issue to be that https://rarbgmirror.com/ uses JavaScript which dynamically loads web pages.经过一番研究,我发现问题在于https://rarbgmirror.com/使用动态加载网页的 JavaScript。 So after some more research I find that selenium is recommended for this purpose.因此,经过更多研究后,我发现为此目的推荐使用硒。 Well after some time using selenium I find some cons to using it such as:使用硒一段时间后,我发现使用它有一些缺点,例如:

  • The slow speed of scraping刮刮速度慢
  • The system which the app is running on must have the selenium browser installed (I'm planning on using pyinstaller to pack the app which would be an issue)运行应用程序的系统必须安装 selenium 浏览器(我打算使用 pyinstaller 打包应用程序,这将是一个问题)

So I'm requesting for an alternative to selenium to scrape dynamically loaded web pages.所以我要求使用 selenium 的替代品来抓取动态加载的网页。

TLDR : I want an alternative to selenium to scrape a website which is dynamically loaded using JavaScript. TLDR :我想要一个 selenium 的替代品来抓取一个使用 JavaScript 动态加载的网站。

PS: GitHub Repo : https://github.com/eliasbenb/MagnetMagnet PS:GitHub 仓库https : //github.com/eliasbenb/MagnetMagnet

If you are using only Chrome, you can check out Puppeteer by Google.如果您只使用 Chrome,您可以查看 Google 的Puppeteer It is fast and integrates quite well with Chrome DevTools.它速度很快,并且与 Chrome DevTools 集成得很好。

WORKING SOLUTION DISCLAIMER FOR PEOPLE LOOKING FOR AN ANSWER: this method WILL NOT work for any website other than RARBG寻找答案的人的工作解决方案免责声明:此方法不适用于除 RARBG 以外的任何网站

I posted this same question to reddit's r/learnpython someone on there found a great answer which met all my requirements.我向 reddit 的 r/learnpython 发布了同样的问题,那里有人找到了一个很好的答案,满足了我的所有要求。 You can find the original comment here你可以在这里找到原始评论

What he found out was that rarbg gets its info from here他发现 rarbg 从这里获取信息

You can change what is searcher by changing "QUERY" in the link.您可以通过更改链接中的“查询”来更改搜索者的内容。 On that page was all the information for each torrent, so using requests and bs4 I scraped all the information.在那个页面上有每个 torrent 的所有信息,所以我使用 requests 和 bs4 抓取了所有信息。

Here is the working code:这是工作代码:

query = input("Input a search: ")
rarbg_link = 'https://torrentapi.org/pubapi_v2.php?mode=search&search_string=' + query + '&token=lnjzy73ucv&format=json_extended&app_id=lol'
try:
    request = requests.get(rarbg_link, headers={'User-Agent': 'Mozilla/5.0'})
except:
    print("ERROR")
source = request.text
soup = str(BeautifulSoup(source, 'lxml'))
soup = soup.replace('<html><body><p>{"torrent_results":[', '')
soup = soup.split(',')
titles = str([i for i in soup if i.startswith('{"title":')])
titles = titles.replace('{"title":"', '')
titles = titles.replace('"', '')
titles = titles.split("', '")
for title in titles:
    title.append(titles)
    links = str([i for i in soup if i.startswith('"download":')])
    links = links.replace('"download":"', '')
    links = links.replace('"', '')
    links = links.split("', '")
    for link in links:
        magnets.append(link)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM