繁体   English   中英

Web 与 BeautifulSoup 一起抓取 - 尝试查找表时不返回内容

[英]Web scraping with BeautifulSoup - when trying to find table the content is not returned

我正在尝试为一个表抓取网站,但只返回 header。

I am new to python and web scraping and have followed the following material which was very helpful https://medium.com/analytics-vidhya/how-to-scrape-a-table-from-website-using-python-ce90d0cfb607 .

但是,以下代码仅返回 header 而不是表的主体。

# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
# Create object page
page = requests.get(url)

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

# Obtain information from tag <table>
table1 = soup.find_all('table')
table1

Output:

[<table aria-label="Declared Dividends" class="mdc-data-table__table">
 <thead>
 <tr class="mdc-data-table__header-row">
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Company</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Ticker</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Country</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Exchange</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Share Price</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Prev. Dividend</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Dividend</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Ex-date</th>
 </tr>
 </thead>
 <tbody></tbody>
 </table>]

我需要检索 tbody 内容(在扩展倒数第二行输出时找到)。

仅供参考,以下代码将用于创建 dataframe。

# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)

# Create a dataframe
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

您所访问的页面与教程不同。 如果您尝试使用 beautifulsoup 学习/练习,可能不是最好的网站。 但对我来说,数据以漂亮的 json 格式返回。

import requests
import pandas as pd

# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}

jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData)

Output:

print(df)
                                            name  ...                 ind
0                                   3i Group plc  ...  [22, 25, 23, 3, 5]
1                          3I Infrastructure Plc  ...              [4, 5]
2                                AB Dynamics plc  ...                  []
3    Aberdeen Smaller Companies Income Trust plc  ...                  []
4      Aberdeen Standard Equity Income Trust plc  ...                  []
..                                           ...  ...                 ...
146                              Workspace Group  ...      [25, 4, 24, 5]
147                          Wynnstay Properties  ...                  []
148                                 XP Power Ltd  ...              [5, 4]
149                           Yew Grove REIT Plc  ...                  []
150                                       Yougov  ...                  []

[151 rows x 11 columns]

您正在抓取的这个网站有一个 api 来获取填充表格的数据。 您正在发送请求并取回尚未填充表数据的页面的 html 框架。

如果您 go 检查页面和 go 到浏览器的网络选项卡,请注意 fetch/xhr 请求。 您应该看到请求 go 到: https://www.dividendmax.com/dividends/declared.Z466DEEC76ECDF5FCA6D38571F6324D38571F6324D38571F6324D54Z? 您可以通过向该 url 发送请求来直接查询该数据:

page = requests.get("https://www.dividendmax.com/dividends/declared.json?region=1")
page.json()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM