简体   繁体   中英

Table not scraping correctly python BeautifulSoup

I have the following code which is trying to scrape the main table on this page. I need to get the NORAD ID and Launch date the 2nd and 4th columns. However I can't get BeutifulSoup to find the table by going of its ID.

import requests
from bs4 import BeautifulSoup

data = []

URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

print(data)

For getting NORAD ID and Launch date , You can try it:

import pandas as pd

url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)

data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)

Output will be:

在此处输入图像描述

Change

soup = BeautifulSoup(page.content, 'html.parser')

to

soup = BeautifulSoup(page.content, 'lxml')

If you print the soup and do a search you will not find the id you are looking for in the output. This most likely means this page is JavaScript rendered. You can look into using PhantomJS or selenium. I used selenium to solve a problem like this that I ran into. You will need to download chrome driver: https://chromedriver.chromium.org/downloads . Here is the code that I used.

driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')

What this does is sets up the driver to connect to the url, waits until its loaded, grabs all the code and puts it into the BeautifulSoup object.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM