使用 Python 从 HTML 中提取具有父标记类型的文本

Question

I'm looking to extract text and element type from some HTML. For example:我想从一些 HTML 中提取文本和元素类型。例如：

<div>
    some text
    <h1>some header</h1>
    some more text
</div>

Should give:应该给：

[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]

How can I parse through the HTML to extract this information?我如何解析 HTML 以提取此信息？

I've tried using BeautifulSoup and am able to extract the information for one level in the HTML, like this:我试过使用BeautifulSoup并且能够在 HTML 中提取一个级别的信息，如下所示：

soup = BeautifulSoup(html, features='html.parser')

for child in soup.findChildren(recursive=False):
    print(child.name)
    for c in child.contents:
        print(c.name)
        print(c.text)

Which gives the following output:其中给出以下 output：

div
None
   text here

h1
some header
None
  more text here

Answer 1

Using lxml and recursion I can do使用lxml和递归我可以做到

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

def display(item):
    print('item:', item)
    print('tag :', item.tag)
    print('text:', item.text.strip())
    tail = item.tail.strip()
    if tail:
        print('tail:', tail, '| parent:', item.getparent().tag)
    
    print('---')
    
    for child in item.getchildren():
        display(child)
        
import lxml.html

soup = lxml.html.fromstring(text)

display(soup)

Which gives这使

item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---

It treats some more text as tail of h1 but you can use getparent() to assign it to div它将some more text视为h1的尾部，但您可以使用getparent()将其分配给div

After small modification小修改后

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

import lxml.html

results = []

def convert(item):
    results.append({'tag': item.tag, 'text': item.text.strip()})
    
    tail = item.tail.strip()
    
    if tail:
        results.append({'tag': item.getparent().tag, 'text': tail})
    
    for child in item.getchildren():
        convert(child)
        
soup = lxml.html.fromstring(text)

convert(soup)

print(results)

it gives results它给出了结果

[
   {'tag': 'div', 'text': 'some text'}, 
   {'tag': 'h1', 'text': 'some header'}, 
   {'tag': 'div', 'text': 'some more text'}
]

Answer 2

I managed to get it working now using BeautifulSoup as well:我现在也设法使用 BeautifulSoup 让它工作：

def sanitize(element):
    element = element.replace('\n',' ')
    while '  ' in element:
        element = element.replace('  ', ' ')
    return element.strip()

def parse(soup, tag):
    for child in soup.findChildren(recursive=False):
        name = child.name
        for content in child.contents:
            if not content.name:
                yield sanitize(content.text), name
            else:
                parse(content, name)
                yield sanitize(content.text), content.name

html = """
<div>
    text here
    <h1>some header</h1>
    more text here
</div>
"""

soup = BeautifulSoup(html, features='html.parser')
list(parse(soup, 'html'))

which gives:这使：

[('text here', 'div'), ('some header', 'h1'), ('more text here', 'div')]

使用 Python 从 HTML 中提取具有父标记类型的文本

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-04-20 20:38:11

解决方案2
1 2022-04-20 21:41:16

使用 Python 从 HTML 中提取具有父标记类型的文本

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-04-20 20:38:11

解决方案2 1 2022-04-20 21:41:16

解决方案1
1 已采纳 2022-04-20 20:38:11

解决方案2
1 2022-04-20 21:41:16