简体   繁体   中英

unable to parse html table with Beautiful Soup

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe. However, the final result has the correct columns names, but no numbers for the rows. What should I be doing instead?

Here is my code:

from bs4 import BeautifulSoup
import requests

def get_tables(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find_all('table')
    return pd.read_html(str(table))[0]

url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)

The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:

import json
import requests 
import pandas as pd

data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):

在此处输入图片说明

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering. You will have to find an alternative way of accessing the data or render the webpages JS ( see this example ).

A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.

Here is a quick example:

import time 

import pandas as pd 
from selenium.webdriver import Chrome

#Request the dynamically loaded page source 
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')

#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source

#Load into pd.DataFrame 
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel()    #Convert the MultiIndex to an Index 

Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html . You'll have to do some more cleaning from there but that's the gist.

Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM