[英]How to extract html between two different tags in python?
I have the following html: 我有以下html:
<h2>blah</h2>
html content to extract
(here can come tags, nested structures too, but no top-level h2)
<h2>other blah</h2>
Can I extract the content without using string.split("<h2>")
in python? 是否可以在不使用python中的
string.split("<h2>")
情况下提取内容?
(Say, with BeautifulSoup or with some other library?) (例如,使用BeautifulSoup或其他一些库?)
With BeautifulSoup, use the .next_siblings
iterable to get to text following a tag: 使用BeautifulSoup,使用
.next_siblings
可迭代以获取标签后面的文本:
>>> from bs4 import BeautifulSoup, NavigableString
>>> from itertools import takewhile
>>> sample = '<h2>blah</h2>\nhtml content to extract\n<h2>other blah<h2>'
>>> soup = BeautifulSoup(sample)
>>> print ''.join(takewhile(lambda e: isinstance(e, NavigableString), soup.h2.next_siblings))
html content to extract
This finds all text elements following the soup.h2
element and joins them into one string. 这将找到
soup.h2
元素之后的所有文本元素, soup.h2
它们连接到一个字符串中。
Here are some test code using HTQL from http://htql.net : 这是来自http://htql.net的使用HTQL的一些测试代码:
sample="""<h2>blah</h2>
html content to extract
<div>test</div>
<h2>other blah<h2>
"""
import htql
htql.query(sample, "<h2 sep excl>2")
# [('\n html content to extract \n <div>test</div>\n ',)]
htql.query(sample, "<h2 sep> {a=<h2>:tx; b=<h2 sep excl>2 | a='blah'} ")
# [('blah', '\n html content to extract \n <div>test</div>\n ')]
Let me share a bit more robust solution: 让我分享一个更强大的解决方案:
def get_chunk_after_tag(tag):
""" tag is a tag element in a bs4 soup.
"""
result = ''
for elem in tag.next_siblings:
if isinstance(elem, bs4.Tag) and elem.name == tag.name:
break
result += str(elem)
return result
For extracting text from <hX>
to <hX>
. 用于将文本从
<hX>
提取到<hX>
。 It is easily modified to extract text from a tag to another. 它很容易修改以将文本从标签提取到另一个标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.