简体   繁体   中英

What would be the best way to scrape this website? (Not Selenium)

Before I begin TLDR is at the bottom

So I'm trying to scrape https://rarbgmirror.com/ for torrent magnet links and for their torrent title names based on user inputted searches. I've already figured out how to do this using BeautifulSoup and Requests through this code:

from bs4 import BeautifulSoup
import requests
import re

query = input("Input a search: ")
link = 'https://rarbgmirror.com/torrents.php?search=' + query

magnets = []
titles = []
try:
    request = requests.get(link)
except:
    print("ERROR")
source = request.text
soup = BeautifulSoup(source, 'lxml')
for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}):
    page_link = 'https://www.1377x.to/' + page_link.get('href')
    try:
        page_request = requests.get(page_link)
    except:
        print("ERROR")

    page_source = page_request.content
    page_soup = BeautifulSoup(page_source, 'lxml')
    link = page_soup.find('a', attrs={'href': re.compile("^magnet")})
    magnets.append(link.get('href'))
    title = page_soup.find('h1')
    titles.append(title)

print(titles)
print(magnets)

I am almost certain that this code has no error in it because the code was originally made for https://1377x.to for the same purpose, and if you look through the HTML structure of both websites, they use the same tags for magnet links and title names. But if the code is faulty please point that out to me!

After some research I found the issue to be that https://rarbgmirror.com/ uses JavaScript which dynamically loads web pages. So after some more research I find that selenium is recommended for this purpose. Well after some time using selenium I find some cons to using it such as:

  • The slow speed of scraping
  • The system which the app is running on must have the selenium browser installed (I'm planning on using pyinstaller to pack the app which would be an issue)

So I'm requesting for an alternative to selenium to scrape dynamically loaded web pages.

TLDR : I want an alternative to selenium to scrape a website which is dynamically loaded using JavaScript.

PS: GitHub Repo : https://github.com/eliasbenb/MagnetMagnet

If you are using only Chrome, you can check out Puppeteer by Google. It is fast and integrates quite well with Chrome DevTools.

WORKING SOLUTION DISCLAIMER FOR PEOPLE LOOKING FOR AN ANSWER: this method WILL NOT work for any website other than RARBG

I posted this same question to reddit's r/learnpython someone on there found a great answer which met all my requirements. You can find the original comment here

What he found out was that rarbg gets its info from here

You can change what is searcher by changing "QUERY" in the link. On that page was all the information for each torrent, so using requests and bs4 I scraped all the information.

Here is the working code:

query = input("Input a search: ")
rarbg_link = 'https://torrentapi.org/pubapi_v2.php?mode=search&search_string=' + query + '&token=lnjzy73ucv&format=json_extended&app_id=lol'
try:
    request = requests.get(rarbg_link, headers={'User-Agent': 'Mozilla/5.0'})
except:
    print("ERROR")
source = request.text
soup = str(BeautifulSoup(source, 'lxml'))
soup = soup.replace('<html><body><p>{"torrent_results":[', '')
soup = soup.split(',')
titles = str([i for i in soup if i.startswith('{"title":')])
titles = titles.replace('{"title":"', '')
titles = titles.replace('"', '')
titles = titles.split("', '")
for title in titles:
    title.append(titles)
    links = str([i for i in soup if i.startswith('"download":')])
    links = links.replace('"download":"', '')
    links = links.replace('"', '')
    links = links.split("', '")
    for link in links:
        magnets.append(link)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM