Python - 無法從特定網站抓取數據 (CourseHero)

Question

我最近想啟動一個使用 Python 與網站（ https://www.coursehero.com/ ）交互的機器人項目。 但是.. 看起來那個網站有一個高反僵屍安全系統，在嘗試登錄時返回一個錯誤的 html 數據，所以我想唯一可能的方法是作為假人機界面登錄。

我在一個學習小組中，該機器人的整個想法是讓參與者從我在該網站上的高級帳戶下載文檔文件，而無需向他們提供登錄信息。

我已經非常擅長 Python，但對網頁抓取很陌生。

有什么幫助嗎？

Answer 1

我不知道你想與網站互動什么。 但我目前正在通過使用代理提供程序（例如 ScraperAPI）來抓取網站（包括 Google、Wikipedia、Quora...）。

優點：

全世界有數以千計的代理。
成功率高。
每月 1000 個免費計划請求。

缺點：

請求時間相當長。
有些請求可能會失敗，需要再請求幾次。

Answer 2

Course Hero 前端向https://www.coursehero.com/api/v2/search發送 POST 請求，並呈現來自 JavaScript 的搜索結果。

只需通過 HTTP 請求獲取 JSON 即可。 完整的例子。 我沒有付費帳戶，所以代碼的最后一部分被注釋了，因為它是一個偽代碼。

import requests

headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.3987.78 Safari/537.36'
}

data = {
    "client": "web",
    "query": "scrape",
    "view": "list_w",
    "filters": {
        "type": ["document"],
        "doc_type": [],
    },
    "sort": "relevancy",
    "limit": 20,
    "offset": 0,
    "callout_types": ["textbook"]
}

response = requests.post(
    'https://www.coursehero.com/api/v2/search/', headers=headers, json=data)

data = response.json()

for result in data['results']:
    url = f"https://www.coursehero.com/file/{result['document']['db_filename']}"
    print(f"'{result['core']['title']}' URL: {url}")

    # Login and extract download URL from HTML
    #
    # response = requests.get(url, headers=headers)
    # soup = BeautifulSoup(response.content, 'lxml')
    # download_url = soup.select('...')
    #
    # OR
    #
    # Download file via direct HTTP request if URL is returned via XHR request
    #
    # download_url = 'https://www.coursehero.com/...'
    # requests.get(download_url, headers=headers)

Output

'Week 6 - Web Scraping.pptx' URL: https://www.coursehero.com/file/38748386
'Python web_scraping train.docx' URL: https://www.coursehero.com/file/70193727
'ScrAPES Book' URL: https://www.coursehero.com/file/6219095
'scrape.py' URL: https://www.coursehero.com/file/43396377
'scrAPES - Rain didn't Boost Lakes' URL: https://www.coursehero.com/file/10042922
'orders cannot scrape.docx' URL: https://www.coursehero.com/file/75016027

...

Python - 無法從特定網站抓取數據 (CourseHero)

問題描述

2 個解決方案

解決方案1
0 2021-01-25 04:26:07

解決方案2
0 已采納 2021-01-25 14:34:19

Python - 無法從特定網站抓取數據 (CourseHero)

問題描述

2 個解決方案

解決方案1 0 2021-01-25 04:26:07

解決方案2 0 已采納 2021-01-25 14:34:19

解決方案1
0 2021-01-25 04:26:07

解決方案2
0 已采納 2021-01-25 14:34:19