[英]Web scraping with BeautifulSoup - when trying to find table the content is not returned
我正在尝试为一个表抓取网站,但只返回 header。
I am new to python and web scraping and have followed the following material which was very helpful https://medium.com/analytics-vidhya/how-to-scrape-a-table-from-website-using-python-ce90d0cfb607 .
但是,以下代码仅返回 header 而不是表的主体。
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
# Create object page
page = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
# Obtain information from tag <table>
table1 = soup.find_all('table')
table1
Output:
[<table aria-label="Declared Dividends" class="mdc-data-table__table">
<thead>
<tr class="mdc-data-table__header-row">
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Company</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Ticker</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Country</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Exchange</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Share Price</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Prev. Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Ex-date</th>
</tr>
</thead>
<tbody></tbody>
</table>]
我需要检索 tbody 内容(在扩展倒数第二行输出时找到)。
仅供参考,以下代码将用于创建 dataframe。
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
您所访问的页面与教程不同。 如果您尝试使用 beautifulsoup 学习/练习,可能不是最好的网站。 但对我来说,数据以漂亮的 json 格式返回。
import requests
import pandas as pd
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
name ... ind
0 3i Group plc ... [22, 25, 23, 3, 5]
1 3I Infrastructure Plc ... [4, 5]
2 AB Dynamics plc ... []
3 Aberdeen Smaller Companies Income Trust plc ... []
4 Aberdeen Standard Equity Income Trust plc ... []
.. ... ... ...
146 Workspace Group ... [25, 4, 24, 5]
147 Wynnstay Properties ... []
148 XP Power Ltd ... [5, 4]
149 Yew Grove REIT Plc ... []
150 Yougov ... []
[151 rows x 11 columns]
您正在抓取的这个网站有一个 api 来获取填充表格的数据。 您正在发送请求并取回尚未填充表数据的页面的 html 框架。
如果您 go 检查页面和 go 到浏览器的网络选项卡,请注意 fetch/xhr 请求。 您应该看到请求 go 到: https://www.dividendmax.com/dividends/declared.Z466DEEC76ECDF5FCA6D38571F6324D38571F6324D38571F6324D54Z? 您可以通过向该 url 发送请求来直接查询该数据:
page = requests.get("https://www.dividendmax.com/dividends/declared.json?region=1")
page.json()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.