表未正確刮擦 python BeautifulSoup

Question

我有以下代碼試圖抓取此頁面上的主表。 我需要在第 2 列和第 4 列獲取 NORAD ID 和啟動日期。 但是，我無法讓 BeutifulSoup 通過其 ID 找到該表。

import requests
from bs4 import BeautifulSoup

data = []

URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

print(data)

Answer 1

要獲取NORAD ID和Launch date ，您可以嘗試：

import pandas as pd

url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)

data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)

Output 將是：

Answer 2

改變

soup = BeautifulSoup(page.content, 'html.parser')

至

soup = BeautifulSoup(page.content, 'lxml')

Answer 3

如果您打印湯並進行搜索，您將不會在 output 中找到您要查找的 id。 這很可能意味着此頁面是 JavaScript 呈現的。 您可以考慮使用 PhantomJS 或 selenium。 我使用 selenium 來解決我遇到的此類問題。 您需要下載 chrome 驅動程序： https://chromedriver.chromium.org/downloads 。 這是我使用的代碼。

driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')

它的作用是將驅動程序設置為連接到 url，等待其加載，獲取所有代碼並將其放入 BeautifulSoup object 中。

希望這可以幫助！

表未正確刮擦 python BeautifulSoup

問題描述

3 個解決方案

解決方案1
1 已采納 2020-06-20 05:53:24

解決方案2
0 2020-06-20 06:00:50

解決方案3
0 2020-06-20 06:05:53

表未正確刮擦 python BeautifulSoup

問題描述

3 個解決方案

解決方案1 1 已采納 2020-06-20 05:53:24

解決方案2 0 2020-06-20 06:00:50

解決方案3 0 2020-06-20 06:05:53

解決方案1
1 已采納 2020-06-20 05:53:24

解決方案2
0 2020-06-20 06:00:50

解決方案3
0 2020-06-20 06:05:53