简体   繁体   English

如何使用 Beautiful Soup 和正则表达式字符串查找 html 元素

[英]How to find an html element using Beautiful Soup and regex strings

I am trying to find the following <li> element in an html document using python 3, beautiful soup and regex strings.我正在尝试使用 python 3、漂亮的汤和正则表达式字符串在 html 文档中找到以下<li>元素。

<li style="text-indent:0pt; margin-top:0pt; margin-bottom:0pt;" value="394">KEANE J.
The plaintiff is a Sri Lankan national of Tamil ethnicity.  While he was a
passenger on a vessel travelling from India to
Australia, that vessel ("the
Indian vessel") was intercepted by an Australian border protection vessel ("the
Commonwealth ship")
in Australia's contiguous
zone<span class="sup"><b><a name="fnB313" href="http://www.austlii.edu.au/au/cases/cth/HCA/2015/1.html#fn313">[313]</a></b></span>. 
</li>

I have tried using the following find_all function, which returns an empty list.我尝试使用以下find_all函数,它返回一个空列表。

html.find_all('li', string='KEANE J.')

I have also tried the find function with regex, which returns a none object:我还尝试了使用正则表达式的find函数,它返回一个 none 对象:

html.find('li', string=re.compile(r'^KEANE\sJ\.\s'))

How would I find this element in the html document?我如何在 html 文档中找到这个元素?

it has something to do with the element present?它与存在的元素有关吗?

Absolutely, in this case, aside from the text node, the li element has other children.当然,在这种情况下,除了文本节点之外, li元素还有其他子元素。 This is documented in the .string paragraph :这记录在.string段落中

If a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None如果一个标签包含不止一个东西,那么.string应该指代什么.string不清楚了,所以.string被定义为None

What you can do is to locate the text node itself and then get its parent:您可以做的是定位文本节点本身,然后获取其父节点:

li = html.find(string=re.compile(r'^KEANE\sJ\.\s')).parent
print(li)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM