无法使用 Beautiful Soup 解析 html 表

Question

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.我对使用 Beautiful Soup 很陌生，我正在尝试从以下 url 导入数据作为 Pandas 数据框。 However, the final result has the correct columns names, but no numbers for the rows.但是，最终结果具有正确的列名称，但没有行编号。 What should I be doing instead?我应该怎么做？

Here is my code:这是我的代码：

from bs4 import BeautifulSoup
import requests

def get_tables(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find_all('table')
    return pd.read_html(str(table))[0]

url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)

Answer 1

The data you see in the table is loaded from another URL via JavaScript.您在表中看到的数据是通过 JavaScript 从另一个 URL 加载的。 You can use this example to save the data to csv:您可以使用此示例将数据保存到 csv：

import json
import requests 
import pandas as pd

data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):保存data.csv （来自 LibreOffice 的截图）：

Answer 2

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.您尝试从中抓取数据的网站正在动态呈现表值，并且使用requests.get只会返回服务器在 JavaScript 呈现之前发送的 HTML。 You will have to find an alternative way of accessing the data or render the webpages JS ( see this example ).您必须找到一种替代方法来访问数据或呈现网页 JS（请参阅此示例）。

A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.这样做的一种常见方法是使用selenium来自动化浏览器，这允许您以这种方式呈现 JavaScript 并获取源代码。

Here is a quick example:这是一个快速示例：

import time 

import pandas as pd 
from selenium.webdriver import Chrome

#Request the dynamically loaded page source 
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')

#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source

#Load into pd.DataFrame 
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel()    #Convert the MultiIndex to an Index

Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html .请注意，我没有使用 BeautifulSoup，您可以直接将 html 传递给pd.read_html 。 You'll have to do some more cleaning from there but that's the gist.您必须从那里进行更多清洁，但这就是要点。

Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.或者，您可以在requests-html 上达到顶峰，它是一个提供 JavaScript 渲染的库，可能能够提供帮助，搜索从其他地方以 JSON 或 .csv 形式访问数据并使用它等的方法。

无法使用 Beautiful Soup 解析 html 表

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-10-04 19:15:21

解决方案2
1 2020-10-04 19:25:15

无法使用 Beautiful Soup 解析 html 表

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-10-04 19:15:21

解决方案2 1 2020-10-04 19:25:15

解决方案1
3 已采纳 2020-10-04 19:15:21

解决方案2
1 2020-10-04 19:25:15