[英]How to find an html element using Beautiful Soup and regex strings
I am trying to find the following <li>
element in an html document using python 3, beautiful soup and regex strings.我正在尝试使用 python 3、漂亮的汤和正则表达式字符串在 html 文档中找到以下
<li>
元素。
<li style="text-indent:0pt; margin-top:0pt; margin-bottom:0pt;" value="394">KEANE J.
The plaintiff is a Sri Lankan national of Tamil ethnicity. While he was a
passenger on a vessel travelling from India to
Australia, that vessel ("the
Indian vessel") was intercepted by an Australian border protection vessel ("the
Commonwealth ship")
in Australia's contiguous
zone<span class="sup"><b><a name="fnB313" href="http://www.austlii.edu.au/au/cases/cth/HCA/2015/1.html#fn313">[313]</a></b></span>.
</li>
I have tried using the following find_all
function, which returns an empty list.我尝试使用以下
find_all
函数,它返回一个空列表。
html.find_all('li', string='KEANE J.')
I have also tried the find
function with regex, which returns a none object:我还尝试了使用正则表达式的
find
函数,它返回一个 none 对象:
html.find('li', string=re.compile(r'^KEANE\sJ\.\s'))
How would I find this element in the html document?我如何在 html 文档中找到这个元素?
it has something to do with the element present?
它与存在的元素有关吗?
Absolutely, in this case, aside from the text node, the li
element has other children.当然,在这种情况下,除了文本节点之外,
li
元素还有其他子元素。 This is documented in the .string
paragraph :这记录在
.string
段落中:
If a tag contains more than one thing, then it's not clear what
.string
should refer to, so.string
is defined to beNone
如果一个标签包含不止一个东西,那么
.string
应该指代什么.string
不清楚了,所以.string
被定义为None
What you can do is to locate the text node itself and then get its parent:您可以做的是定位文本节点本身,然后获取其父节点:
li = html.find(string=re.compile(r'^KEANE\sJ\.\s')).parent
print(li)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.