简体   繁体   English

根据元素字符串中的特定单词搜索HTML元素

[英]Searching for HTML Elements based on a specific word in the element string

Trying to create a program that can find and replace tags in certain, specified elements using the Beautiful Soup module. 尝试创建一个程序,该程序可以使用Beautiful Soup模块在某些指定的元素中查找和替换标签。 However - I'm having trouble figuring out how to “find” these elements by “searching” via a specific word that is found in the element's string. 但是,我很难弄清楚如何通过在元素字符串中找到的特定单词“搜索”来“查找”这些元素。 Assuming I can get my code to “find” these elements by their specified word-in-string, I would then “unwrap” the element's “p” tag and “wrap” them in their new “h1” tag. 假设我可以通过指定的字符串单词来“查找”这些元素,然后“解开”元素的“ p”标签,然后将其“包装”在新的“ h1”标签中。

Here's some example HTML code as the input: 这是一些示例HTML代码作为输入:

<p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
<p> Example#2  this element ignored </p>
<p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <p>

Here's my code so far (searching by “ExampleStringWord#1”): 到目前为止,这是我的代码(通过“ ExampleStringWord#1”搜索):

for h1_tag in soup.find_all(string="ExampleStringWord#1"):
            soup.p.wrap(soup.h1_tag("h1"))

If using the example HTML input above, I want the code to come out like this: 如果使用上面的示例HTML输入,我希望代码如下所示:

<h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1>
<p> Example#2  this element ignored </p>
<h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <h1>

However, my code only finds the elements that explicitly contain “ExampleStringWord#1” exclusively , and will exclude elements that contain any string wording past that. 但是,我的代码仅查找专门包含“ ExampleStringWord#1”的元素,而将排除包含任何超出此范围的字符串的元素。 I'm convinced that I will somehow need to utilize the regular expressions to find my specified word's (in addition to whatever string wording that follows) element. 我坚信,我将需要以某种方式使用正则表达式来查找我指定的单词的元素(以及随后的任何字符串措词)。 However, I'm not super familiar with Regular Expressions so I'm not sure how to approach this in conjunction with the BeautifulSoup module. 但是,我对正则表达式不是很熟悉,因此我不确定如何将其与BeautifulSoup模块结合使用。

Also – I've reviewed the documentation in Beautiful Soup for passing in a Regular Expression as a filter ( https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression ), but I've not been able to get it to work in my case. 另外–我已经查看了Beautiful Soup中的文档,以将正则表达式作为过滤器进行传递( https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression ),但是我在我的情况下无法使其正常工作。 I've also reviewed other posts here related to passing regular expressions through beautiful soup but I've not found anything that has adequately addressed my issue. 我还查看了其他与通过正则表达式传递正则表达式相关的文章,但没有发现任何可以解决我的问题的文章。 Any help appreciated! 任何帮助表示赞赏!

What if you would locate the p elements with a specified substring (note the re.compile() part) and then replace the element's name with h1 : 如果要用指定的子字符串查找p元素(请注意re.compile()部分),然后用h1替换元素名称,该怎么办:

import re

from bs4 import BeautifulSoup

data = """
<body>
    <p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
    <p> Example#2  this element ignored </p>
    <p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all("p", string=re.compile("ExampleStringWord#1")):
    p.name = 'h1'
print(soup)

Prints: 打印:

<body>
    <h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1>
    <p> Example#2  this element ignored </p>
    <h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </h1>
</body>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM