[英]What would be the best way to scrape this website? (Not Selenium)
Before I begin TLDR is at the bottom在我开始之前 TLDR 在底部
So I'm trying to scrape https://rarbgmirror.com/ for torrent magnet links and for their torrent title names based on user inputted searches.所以我试图根据用户输入的搜索来抓取https://rarbgmirror.com/的 torrent 磁力链接和他们的 torrent 标题名称。 I've already figured out how to do this using BeautifulSoup and Requests through this code:
我已经想出了如何通过以下代码使用 BeautifulSoup 和 Requests 来做到这一点:
from bs4 import BeautifulSoup
import requests
import re
query = input("Input a search: ")
link = 'https://rarbgmirror.com/torrents.php?search=' + query
magnets = []
titles = []
try:
request = requests.get(link)
except:
print("ERROR")
source = request.text
soup = BeautifulSoup(source, 'lxml')
for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}):
page_link = 'https://www.1377x.to/' + page_link.get('href')
try:
page_request = requests.get(page_link)
except:
print("ERROR")
page_source = page_request.content
page_soup = BeautifulSoup(page_source, 'lxml')
link = page_soup.find('a', attrs={'href': re.compile("^magnet")})
magnets.append(link.get('href'))
title = page_soup.find('h1')
titles.append(title)
print(titles)
print(magnets)
I am almost certain that this code has no error in it because the code was originally made for https://1377x.to for the same purpose, and if you look through the HTML structure of both websites, they use the same tags for magnet links and title names.我几乎可以肯定这段代码没有错误,因为该代码最初是为https://1377x.to制作的,目的是相同的,如果您查看两个网站的 HTML 结构,它们使用相同的标签来表示磁铁链接和标题名称。 But if the code is faulty please point that out to me!
但是如果代码有问题,请指出来!
After some research I found the issue to be that https://rarbgmirror.com/ uses JavaScript which dynamically loads web pages.经过一番研究,我发现问题在于https://rarbgmirror.com/使用动态加载网页的 JavaScript。 So after some more research I find that selenium is recommended for this purpose.
因此,经过更多研究后,我发现为此目的推荐使用硒。 Well after some time using selenium I find some cons to using it such as:
使用硒一段时间后,我发现使用它有一些缺点,例如:
So I'm requesting for an alternative to selenium to scrape dynamically loaded web pages.所以我要求使用 selenium 的替代品来抓取动态加载的网页。
TLDR : I want an alternative to selenium to scrape a website which is dynamically loaded using JavaScript. TLDR :我想要一个 selenium 的替代品来抓取一个使用 JavaScript 动态加载的网站。
PS: GitHub Repo : https://github.com/eliasbenb/MagnetMagnet PS:GitHub 仓库: https : //github.com/eliasbenb/MagnetMagnet
WORKING SOLUTION DISCLAIMER FOR PEOPLE LOOKING FOR AN ANSWER: this method WILL NOT work for any website other than RARBG寻找答案的人的工作解决方案免责声明:此方法不适用于除 RARBG 以外的任何网站
I posted this same question to reddit's r/learnpython someone on there found a great answer which met all my requirements.我向 reddit 的 r/learnpython 发布了同样的问题,那里有人找到了一个很好的答案,满足了我的所有要求。 You can find the original comment here
你可以在这里找到原始评论
What he found out was that rarbg gets its info from here他发现 rarbg 从这里获取信息
You can change what is searcher by changing "QUERY" in the link.您可以通过更改链接中的“查询”来更改搜索者的内容。 On that page was all the information for each torrent, so using requests and bs4 I scraped all the information.
在那个页面上有每个 torrent 的所有信息,所以我使用 requests 和 bs4 抓取了所有信息。
Here is the working code:这是工作代码:
query = input("Input a search: ")
rarbg_link = 'https://torrentapi.org/pubapi_v2.php?mode=search&search_string=' + query + '&token=lnjzy73ucv&format=json_extended&app_id=lol'
try:
request = requests.get(rarbg_link, headers={'User-Agent': 'Mozilla/5.0'})
except:
print("ERROR")
source = request.text
soup = str(BeautifulSoup(source, 'lxml'))
soup = soup.replace('<html><body><p>{"torrent_results":[', '')
soup = soup.split(',')
titles = str([i for i in soup if i.startswith('{"title":')])
titles = titles.replace('{"title":"', '')
titles = titles.replace('"', '')
titles = titles.split("', '")
for title in titles:
title.append(titles)
links = str([i for i in soup if i.startswith('"download":')])
links = links.replace('"download":"', '')
links = links.replace('"', '')
links = links.split("', '")
for link in links:
magnets.append(link)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.