[英]Extracting text with parent tag type from HTML using Python
I'm looking to extract text and element type from some HTML. For example:我想从一些 HTML 中提取文本和元素类型。例如:
<div>
some text
<h1>some header</h1>
some more text
</div>
Should give:应该给:
[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]
How can I parse through the HTML to extract this information?我如何解析 HTML 以提取此信息?
I've tried using BeautifulSoup
and am able to extract the information for one level in the HTML, like this:我试过使用
BeautifulSoup
并且能够在 HTML 中提取一个级别的信息,如下所示:
soup = BeautifulSoup(html, features='html.parser')
for child in soup.findChildren(recursive=False):
print(child.name)
for c in child.contents:
print(c.name)
print(c.text)
Which gives the following output:其中给出以下 output:
div
None
text here
h1
some header
None
more text here
Using lxml
and recursion I can do使用
lxml
和递归我可以做到
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
def display(item):
print('item:', item)
print('tag :', item.tag)
print('text:', item.text.strip())
tail = item.tail.strip()
if tail:
print('tail:', tail, '| parent:', item.getparent().tag)
print('---')
for child in item.getchildren():
display(child)
import lxml.html
soup = lxml.html.fromstring(text)
display(soup)
Which gives这使
item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---
It treats some more text
as tail of h1
but you can use getparent()
to assign it to div
它将
some more text
视为h1
的尾部,但您可以使用getparent()
将其分配给div
After small modification小修改后
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
import lxml.html
results = []
def convert(item):
results.append({'tag': item.tag, 'text': item.text.strip()})
tail = item.tail.strip()
if tail:
results.append({'tag': item.getparent().tag, 'text': tail})
for child in item.getchildren():
convert(child)
soup = lxml.html.fromstring(text)
convert(soup)
print(results)
it gives results它给出了结果
[
{'tag': 'div', 'text': 'some text'},
{'tag': 'h1', 'text': 'some header'},
{'tag': 'div', 'text': 'some more text'}
]
I managed to get it working now using BeautifulSoup as well:我现在也设法使用 BeautifulSoup 让它工作:
def sanitize(element):
element = element.replace('\n',' ')
while ' ' in element:
element = element.replace(' ', ' ')
return element.strip()
def parse(soup, tag):
for child in soup.findChildren(recursive=False):
name = child.name
for content in child.contents:
if not content.name:
yield sanitize(content.text), name
else:
parse(content, name)
yield sanitize(content.text), content.name
html = """
<div>
text here
<h1>some header</h1>
more text here
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
list(parse(soup, 'html'))
which gives:这使:
[('text here', 'div'), ('some header', 'h1'), ('more text here', 'div')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.