如何從 html 中提取數據

Question

我試圖使用 beautifulsoup4 和 python 來抓取某個網站。 但是，當我試圖查看 URL 的內容時，它只給了我一個 header 部分，並沒有給我一個我想使用的身體部分。

URL = "url"
URL_page = requests.get(URL)
print(URL_page.text)

這給了我

<!DOCTYPE html>
<html>
 <head>
"Contents of Header"
 </head>
  <body>
   <div id='root'></div>
  </body>
</html>

正文標簽內應該有內容，但它什么也不顯示。 這個 web 頁面的原始 html 看起來像

<html xmlns:wb="http://open.weibo.com/wb" style> 
 ▶<head...</head>                     ← ONLY GIVES ME THIS
 ▶<body data-loaded="true">...</body> ← I NEED THIS PART
</html>

Answer 1

如果沒有有效的 URL，很難提供有效的答案，但您的問題確實提供了一些線索。

一方面，你說你在 GET 的響應中收到了這個：

<body>

但隨后您在 web 瀏覽器中看到：

<body data-loaded="true">

這表明該頁面運行了 JavaScript 代碼，該代碼在初始頁面加載后繼續加載和構建頁面。

沒有辦法使用requests或bs4或類似的東西來解決這個問題。 您可以檢查具有實際內容的初始頁面加載之后的請求（它可能是另一塊 html、一些 json 等）並使用該請求來獲取內容。 如果您想嘗試，請嘗試在良好的瀏覽器中打開開發人員工具，並在頁面加載時查看網絡選項卡，您將看到所有請求，其中一個可能包含您所追求的內容。

But if you need the html after rendering, as rendered by the script, you can try using a JavaScript capable browser from Python, like Chrome driven through the Selenium Chrome webdriver:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://your.url/here")
elem = driver.find_element_by_tag_name('body')
print(elem.text)

請注意，您需要安裝 Selenium 並且需要獲取相應驅動程序的副本，例如chromedriver.exe 。 將其添加到您的虛擬環境中：

安裝 selenium pip install selenium
從此處安裝適當的瀏覽器驅動程序，例如 ChromeDriver： https://sites.google.com/a/chromium.org/chromedriver/home
（將可執行文件放在腳本文件夾中）

Answer 2

我認為，您應該使用'user-agent' 。您可以嘗試一下：

from bs4 import BeautifulSoup
import requests

headers =  {'User-Agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0 '}
url = "https://www.pixiv.net/en/users/14792128"
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

Answer 3

不知道您究竟想要什么作為 output。 但是您可以訪問來自 ajax 的 json 響應：

import pandas as pd
import requests

url='https://www.pixiv.net/ajax/user/14792128/profile/all?lang=en'

jsonData = requests.get(url).json()
data = jsonData['body']['mangaSeries']

df = pd.DataFrame(data)

如何從 html 中提取數據

問題描述

3 個解決方案

解決方案1
0 2020-06-11 02:06:10

解決方案2
0 2020-06-11 02:27:15

解決方案3
0 2020-06-11 07:58:55

如何從 html 中提取數據

問題描述

3 個解決方案

解決方案1 0 2020-06-11 02:06:10

解決方案2 0 2020-06-11 02:27:15

解決方案3 0 2020-06-11 07:58:55

解決方案1
0 2020-06-11 02:06:10

解決方案2
0 2020-06-11 02:27:15

解決方案3
0 2020-06-11 07:58:55