简体   繁体   English

表未正确刮擦 python BeautifulSoup

[英]Table not scraping correctly python BeautifulSoup

I have the following code which is trying to scrape the main table on this page.我有以下代码试图抓取页面上的主表。 I need to get the NORAD ID and Launch date the 2nd and 4th columns.我需要在第 2 列和第 4 列获取 NORAD ID 和启动日期。 However I can't get BeutifulSoup to find the table by going of its ID.但是,我无法让 BeutifulSoup 通过其 ID 找到该表。

import requests
from bs4 import BeautifulSoup

data = []

URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

print(data)

For getting NORAD ID and Launch date , You can try it:要获取NORAD IDLaunch date ,您可以尝试:

import pandas as pd

url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)

data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)

Output will be: Output 将是:

在此处输入图像描述

Change改变

soup = BeautifulSoup(page.content, 'html.parser')

to

soup = BeautifulSoup(page.content, 'lxml')

If you print the soup and do a search you will not find the id you are looking for in the output.如果您打印汤并进行搜索,您将不会在 output 中找到您要查找的 id。 This most likely means this page is JavaScript rendered.这很可能意味着此页面是 JavaScript 呈现的。 You can look into using PhantomJS or selenium.您可以考虑使用 PhantomJS 或 selenium。 I used selenium to solve a problem like this that I ran into.我使用 selenium 来解决我遇到的此类问题。 You will need to download chrome driver: https://chromedriver.chromium.org/downloads .您需要下载 chrome 驱动程序: https://chromedriver.chromium.org/downloads Here is the code that I used.这是我使用的代码。

driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')

What this does is sets up the driver to connect to the url, waits until its loaded, grabs all the code and puts it into the BeautifulSoup object.它的作用是将驱动程序设置为连接到 url,等待其加载,获取所有代码并将其放入 BeautifulSoup object 中。

Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM