[英]How to get <text> tag from an html document using beautiful soup
我如何使用美丽的汤从 html 文档中获取<text>
标签以进行Abbot lab 10k 归档
我想使用下面的代码提取<text></text>
标签的所有孩子的标签名称
from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)
但是我从上面的代码中得到的输出是['html']
。
预期输出:
['p','p','p','p','p','p','div','div','font','font', etc......]
您可以使用 CSS 选择器(用于打印标签文本的所有子项):
for child in all_text.select('text *'):
print(child.name, end=' ')
印刷:
br p font font b p font b br p font b div div ...
编辑:为了仅打印标签文本的直接子项,您可以使用:
from bs4 import BeautifulSoup
import requests
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
for child in soup.select('text > *'):
print(child.name, end=' ')
替换您的代码:
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)
到:
all_tags = [x.name for x in all_text.findChildren() if x.name is not None]
print(all_tags)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.