表未正确刮擦 python BeautifulSoup

Question

I have the following code which is trying to scrape the main table on this page.我有以下代码试图抓取此页面上的主表。 I need to get the NORAD ID and Launch date the 2nd and 4th columns.我需要在第 2 列和第 4 列获取 NORAD ID 和启动日期。 However I can't get BeutifulSoup to find the table by going of its ID.但是，我无法让 BeutifulSoup 通过其 ID 找到该表。

import requests
from bs4 import BeautifulSoup

data = []

URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

print(data)

Answer 1

For getting NORAD ID and Launch date , You can try it:要获取NORAD ID和Launch date ，您可以尝试：

import pandas as pd

url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)

data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)

Output will be: Output 将是：

Answer 2

Change改变

soup = BeautifulSoup(page.content, 'html.parser')

to至

soup = BeautifulSoup(page.content, 'lxml')

Answer 3

If you print the soup and do a search you will not find the id you are looking for in the output.如果您打印汤并进行搜索，您将不会在 output 中找到您要查找的 id。 This most likely means this page is JavaScript rendered.这很可能意味着此页面是 JavaScript 呈现的。 You can look into using PhantomJS or selenium.您可以考虑使用 PhantomJS 或 selenium。 I used selenium to solve a problem like this that I ran into.我使用 selenium 来解决我遇到的此类问题。 You will need to download chrome driver: https://chromedriver.chromium.org/downloads .您需要下载 chrome 驱动程序： https://chromedriver.chromium.org/downloads 。 Here is the code that I used.这是我使用的代码。

driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')

What this does is sets up the driver to connect to the url, waits until its loaded, grabs all the code and puts it into the BeautifulSoup object.它的作用是将驱动程序设置为连接到 url，等待其加载，获取所有代码并将其放入 BeautifulSoup object 中。

Hope this helps!希望这可以帮助！

表未正确刮擦 python BeautifulSoup

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-06-20 05:53:24

解决方案2
0 2020-06-20 06:00:50

解决方案3
0 2020-06-20 06:05:53

表未正确刮擦 python BeautifulSoup

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-06-20 05:53:24

解决方案2 0 2020-06-20 06:00:50

解决方案3 0 2020-06-20 06:05:53

解决方案1
1 已采纳 2020-06-20 05:53:24

解决方案2
0 2020-06-20 06:00:50

解决方案3
0 2020-06-20 06:05:53