抓取网站不返回正确的源代码

Question

我试图用 Python 刮掉一个测验匹配集。 我想用class刮掉所有的<span>标签： TermText

这是 URL：'https://quizlet.com/291523268'

import requests
raw = requests.get(URL).text

raw最终返回的东西根本不包含任何标签或卡片。 当我检查网站的来源时，它显示了我需要的所有TermText跨度，这意味着它没有加载 JS。 因此，我不明白为什么我的 HTML 出现错误，因为它不包含我需要的任何 html。

Answer 1

要从服务器获得正确的响应，请设置正确User-Agent HTTP header：

import requests
from bs4 import BeautifulSoup


url = 'https://quizlet.com/291523268/python-flash-cards/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for span in soup.select('span.TermText'):
    print(span.get_text(strip=True))

印刷：

algorithm
A set of specific steps for solving a category of problems
token
basic elements of a language(letters, numbers, symbols)
high-level language
A programming language like Python that is designed to be easy for humans to read and write.
low-level langauge

...and so on.

抓取网站不返回正确的源代码

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-07-30 22:40:29

抓取网站不返回正确的源代码

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-07-30 22:40:29

解决方案1
2 已采纳 2020-07-30 22:40:29