[英]unable to parse html table with Beautiful Soup
I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.我对使用 Beautiful Soup 很陌生,我正在尝试从以下 url 导入数据作为 Pandas 数据框。 However, the final result has the correct columns names, but no numbers for the rows.
但是,最终结果具有正确的列名称,但没有行编号。 What should I be doing instead?
我应该怎么做?
Here is my code:这是我的代码:
from bs4 import BeautifulSoup
import requests
def get_tables(html):
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('table')
return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)
The data you see in the table is loaded from another URL via JavaScript.您在表中看到的数据是通过 JavaScript 从另一个 URL 加载的。 You can use this example to save the data to csv:
您可以使用此示例将数据保存到 csv:
import json
import requests
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')
Saves data.csv
(screenshot from LibreOffice):保存
data.csv
(来自 LibreOffice 的截图):
The website you're trying to scrape data from is rendering the table values dynamically and using requests.get
will only return the HTML the server sends prior to JavaScript rendering.您尝试从中抓取数据的网站正在动态呈现表值,并且使用
requests.get
只会返回服务器在 JavaScript 呈现之前发送的 HTML。 You will have to find an alternative way of accessing the data or render the webpages JS ( see this example ).您必须找到一种替代方法来访问数据或呈现网页 JS( 请参阅此示例)。
A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.这样做的一种常见方法是使用selenium来自动化浏览器,这允许您以这种方式呈现 JavaScript 并获取源代码。
Here is a quick example:这是一个快速示例:
import time
import pandas as pd
from selenium.webdriver import Chrome
#Request the dynamically loaded page source
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index
Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html
.请注意,我没有使用 BeautifulSoup,您可以直接将 html 传递给
pd.read_html
。 You'll have to do some more cleaning from there but that's the gist.您必须从那里进行更多清洁,但这就是要点。
Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.或者,您可以在requests-html 上达到顶峰,它是一个提供 JavaScript 渲染的库,可能能够提供帮助,搜索从其他地方以 JSON 或 .csv 形式访问数据并使用它等的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.