在检查粗体时从 HTML 文件中提取所有文本（Python）

Question

输入：任何包含粗体和非粗体文本的 HTML 文件，分布在不同类型的标签中（例如<div>, , , , <td>等）

期望的输出：一种数据结构（例如数据框或字典），它允许我收集 HTML 文件的所有文本元素，以及某个标签中的文本元素是否为粗体的信息。 例如：

data = {'Text': ['bold text (1)', "text (2)", "text (3)", "bold text (4)"], 'Bold': ["yes", "no", "no", "yes"]}
df = pd.DataFrame(data)

注意：据我所知，粗体文本可以位于两个...标签之间，也可以与任意标签一起使用，属性为 style="font-weight:700;" 或 style="font-weight:bold;"，例如... 。

可重现的示例：这是我的示例 html 文件，其中包含 15 个文本元素，其中 4 个为粗体：

<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>

我想出了如何用漂亮的汤来获取所有的文本元素......

from bs4 import BeautifulSoup
with open(html_file, 'r') as f:
    # create soup object of .html file
    soup = BeautifulSoup(f, 'html.parser')
    soup.findAll(text=True, recursive=True)

# output: ['text (1)', 'text  (2)', 'text (3)', 'text (4)', 'text (5)', 'text (6)', 'text (7)', 'bold text (8)', 'text (9)', 'bold text (10)', 'text (11)', 'bold text (12)', 'text (13)', 'bold text (14)', 'text (15)']

...但我不知道如何获取有关标签属性（字体重量）的信息，也不知道如何检查标签是否为... 。 你能给我一个提示吗？

Answer 1

如果它的name是b或现有的attribute样式，您可以检查文本parent级以更接近一步：

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

例子

from bs4 import BeautifulSoup

html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''

soup = BeautifulSoup(html)

data = []

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

data

输出

[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]

或作为 DataFrame -> pd.DataFrame(data)

	文本	isBoldTag	isBoldStyle
0	文本1）	错误的	错误的
1	文字 (2)	错误的	错误的
2	文字 (3)	错误的	错误的
3	文字 (4)	错误的	错误的
4	文字 (5)	错误的	错误的
5	文字 (6)	错误的	错误的
6	文字 (7)	错误的	错误的
7	粗体字 (8)	错误的	真的
8	正文 (9)	错误的	错误的
9	粗体字 (10)	错误的	真的
10	文字 (11)	错误的	错误的
11	粗体字 (12)	真的	错误的
12	文字 (13)	错误的	错误的
13	粗体字 (14)	真的	错误的
14	文字 (15)	错误的	错误的

在检查粗体时从 HTML 文件中提取所有文本（Python）

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-06-27 18:32:33

例子

输出

在检查粗体时从 HTML 文件中提取所有文本（Python）

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-06-27 18:32:33

例子

输出

解决方案1
1 已采纳 2022-06-27 18:32:33