怎么获得<text>使用美丽的汤从 html 文档中标记

Question

我如何使用美丽的汤从 html 文档中获取<text>标签以进行Abbot lab 10k 归档

我想使用下面的代码提取<text></text>标签的所有孩子的标签名称

from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

但是我从上面的代码中得到的输出是['html'] 。

预期输出：
['p','p','p','p','p','p','div','div','font','font', etc......]

Answer 1

您可以使用 CSS 选择器（用于打印标签文本的所有子项）：

for child in all_text.select('text *'):
    print(child.name, end=' ')

印刷：

br p font font b p font b br p font b div div ...

编辑：为了仅打印标签文本的直接子项，您可以使用：

from bs4 import BeautifulSoup
import requests

url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'

htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")

for child in soup.select('text > *'):
    print(child.name, end=' ')

Answer 2

替换您的代码：

all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

到：

all_tags = [x.name for x in all_text.findChildren() if x.name is not None]
print(all_tags)

findChildren() 更多细节

怎么获得<text>使用美丽的汤从 html 文档中标记

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-26 05:49:46

解决方案2
0 2019-06-26 05:47:20

怎么获得<text>使用美丽的汤从 html 文档中标记

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-26 05:49:46

解决方案2 0 2019-06-26 05:47:20

解决方案1
1 已采纳 2019-06-26 05:49:46

解决方案2
0 2019-06-26 05:47:20