Python：提取 HTML<main> 使用 BeautifulSoup 的数据

Question

我目前正在学习使用 BeautifulSoup 包进行数据抓取。 目前，我正在尝试从 Box Office Mojo 网站 ( https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab ) 获取电影特许经营权列表。

主要问题是我似乎无法访问或提取 <main> 标记中的数据。 下面是我正在使用的代码。

import requests
from bs4 import BeautifulSoup

listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"

r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')

s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')

assert s1 is not None
assert s2 is not None

虽然脚本确实找到了带有“s1”的东西，但它似乎不像我所期望的那样（它应该包含一个带有“a-section mojo-body aok-relative”类的 div）在顶部。 因此，对于“s2”，我没有得到任何结果。

我的问题是：

我究竟做错了什么？ 如何提取 <main> 标签内的数据？
我感觉为每一层创建一个汤对象不是很有效。 提取隐藏在不同 HTML 标签层中的数据的更标准方法是什么？

编辑：打算写 s0.find('main') 而不是 s0.find(id='')。 但是前者返回的结果与后者相同，所以这并不重要。

Answer 1

这是因为s2实际上是None ，因为s1返回：

<script data-a-state='{"key":"a-wlab-states"}' type="a-state">{}</script>

所以搜索id='a-section mojo-body aok-relative应该不会产生任何结果。 因此第二个断言失败。

如果你想刮桌子，你可以只用pandas和requests ，像这样：

import requests
import pandas as pd

df = (
    pd.read_html(
        requests.get(
            "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
        ).text,
        flavor="lxml",
    )[0]
)
print(df)

要得到这个：

                           Franchise  ... Lifetime Gross
0          Marvel Cinematic Universe  ...   $858,373,000
1                          Star Wars  ...   $936,662,225
2    Disney Live Action Reimaginings  ...   $543,638,043
3                         Spider-Man  ...   $804,789,334
4     J.K. Rowling's Wizarding World  ...   $381,011,219
..                               ...  ...            ...
287                 Ip Man Franchise  ...     $2,679,437
288                   Chal Mera Putt  ...       $644,000
289                           Shiloh  ...     $1,007,822
290                       Evangelion  ...       $174,945
291                            V/H/S  ...       $100,345

[292 rows x 5 columns]

Python：提取 HTML<main> 使用 BeautifulSoup 的数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-06-22 14:49:40

Python：提取 HTML<main> 使用 BeautifulSoup 的数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-06-22 14:49:40

解决方案1
2 已采纳 2022-06-22 14:49:40