简体   繁体   English

无法使用 Beautiful Soup 解析 html 表

[英]unable to parse html table with Beautiful Soup

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.我对使用 Beautiful Soup 很陌生,我正在尝试从以下 url 导入数据作为 Pandas 数据框。 However, the final result has the correct columns names, but no numbers for the rows.但是,最终结果具有正确的列名称,但没有行编号。 What should I be doing instead?我应该怎么做?

Here is my code:这是我的代码:

from bs4 import BeautifulSoup
import requests

def get_tables(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find_all('table')
    return pd.read_html(str(table))[0]

url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)

The data you see in the table is loaded from another URL via JavaScript.您在表中看到的数据是通过 JavaScript 从另一个 URL 加载的。 You can use this example to save the data to csv:您可以使用此示例将数据保存到 csv:

import json
import requests 
import pandas as pd

data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):保存data.csv (来自 LibreOffice 的截图):

在此处输入图片说明

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.您尝试从中抓取数据的网站正在动态呈现表值,并且使用requests.get只会返回服务器在 JavaScript 呈现之前发送的 HTML。 You will have to find an alternative way of accessing the data or render the webpages JS ( see this example ).您必须找到一种替代方法来访问数据或呈现网页 JS( 请参阅此示例)。

A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.这样做的一种常见方法是使用selenium来自动化浏览器,这允许您以这种方式呈现 JavaScript 并获取源代码。

Here is a quick example:这是一个快速示例:

import time 

import pandas as pd 
from selenium.webdriver import Chrome

#Request the dynamically loaded page source 
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')

#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source

#Load into pd.DataFrame 
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel()    #Convert the MultiIndex to an Index 

Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html .请注意,我没有使用 BeautifulSoup,您可以直接将 html 传递给pd.read_html You'll have to do some more cleaning from there but that's the gist.您必须从那里进行更多清洁,但这就是要点。

Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.或者,您可以在requests-html 上达到顶峰,它是一个提供 JavaScript 渲染的库,可能能够提供帮助,搜索从其他地方以 JSON 或 .csv 形式访问数据并使用它等的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM