简体   繁体   English

用html文档中的元素替换多个字符串

[英]Replace multiple strings with elements in html document

I have multiple string that I want to wrap HTML tags around within an HTML document. 我有多个字符串,我想将HTML标签包装在HTML文档中。 I want to leave the text the same, but replace the strings with HTML elements containing that string. 我想让文本保持不变,但是将字符串替换为包含该字符串的HTML元素。

Furthermore, some of the strings I want to replace, contain other strings I want to replace. 此外,我要替换的某些字符串包含我要替换的其他字符串。 In these cases, I want to apply the substitution of the larger string and ignore that of the smaller string. 在这些情况下,我想应用较大字符串的替换,而忽略较小字符串的替换。

In addition, I only want to perform this substitution when those strings are contained fully within the same element. 另外,我只想在这些字符串完全包含在同一元素中时执行此替换。

Here's my replacement list. 这是我的替换清单。

replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]

Given the following html: 鉴于以下html:

<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>

I would want to substitute to this: 我想替换为:

<html>
<body>
<p>Paragraph contains <span title="foo" class="customclass34">foo</span></p>
<p>Paragraph contains <span id="id79" class="customclass79">foo bar</span</p>
</body>
</html>

So far I've tried using the beautiful soup library and looping through my replacement list in order of decreasing string length, and I can find and replace my strings with other strings, but I can't work out how to insert the HTML at those points. 到目前为止,我已经尝试使用漂亮的汤类库并以减小字符串长度的顺序遍历我的替换列表,并且可以找到我的字符串并将其替换为其他字符串,但是我不知道如何在这些字符串处插入HTML点。 Or whether there's a better way entirely. 还是完全有更好的方法。 Trying to perform string substitution with a soup.new_tag object fails whether I convert it to a string or not. 无论我是否将其转换为字符串,尝试用soup.new_tag对象执行字符串替换均失败。

EDIT: Realised the example I gave didn't even conform to my own rules, modified example. 编辑:意识到我给的例子甚至不符合我自己的规则,修改了例子。

I think this is very close to what you are looking for. 我认为这与您要寻找的非常接近。 You can use soup.find_all(string=True) to get only the NavigableString elements and then do replace. 您可以使用soup.find_all(string=True)仅获取NavigableString元素,然后进行替换。

from bs4 import BeautifulSoup
html="""
<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>
"""
replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]
soup=BeautifulSoup(html,'html.parser')
for s in soup.find_all(string=True):
    for item in replacement_list[::-1]: #assuming that it is in ascending order of length
        key,val=item
        if key in s:
            new_s=s.replace(key,val)
            s.replace_with(BeautifulSoup(new_s,'html.parser')) #restrict youself to this built-in parser
            break#break on 1st match
print(soup)

#generate a new valid soup that treats span as seperate tag if you want
soup=BeautifulSoup(str(soup),'html.parser')
print(soup.find_all('span'))

Outputs: 输出:

<html>
<body>
<p>Paragraph contains <span class="customclass34" title="foo">foo</span></p>
<p>Paragraph contains <span class="customclass79" id="id21">foo bar</span></p>
</body>
</html>

[<span class="customclass34" title="foo">foo</span>, <span class="customclass79" id="id21">foo bar</span>]

I've found a solution for this. 我已经找到了解决方案。

I have to iterate through the HTML for each different string I want to wrap HTML tags around. 我必须遍历HTML,以便为每个要包装HTML标签的不同字符串。 This seems inefficient, but I can't find a better way of doing it. 这似乎效率不高,但是我找不到更好的方法。

I've added a class to all the tags I'm inserting, which I use to check if the string I'm trying to replace was part of a larger string that was already replaced. 我在要插入的所有标签中添加了一个类,用于检查我尝试替换的字符串是否属于已替换的较大字符串的一部分。

This solution is also case-insensitive (it will wrap tags around the string 'fOo'), while preserving the case of the original text. 该解决方案也不区分大小写(它将标签包裹在字符串'fOo'周围),同时保留原始文本的大小写。

def html_update(input_html):
    from bs4 import BeautifulSoup
    import re

    soup = BeautifulSoup(input_html)

    replacement_list = [
        ('foo', '<span title="foo" class="customclass34 replace">', '</span>'),
        ('foo bar', '<span id="id21" class="customclass79 replace">', '</span>')
    ]
    # Go through list in order of decreasing length
    replacement_list = sorted(replacement_list, key = lambda k: -len(k[0]))

    for item in replacement_list:
        replace_regex = re.compile(item[0], re.IGNORECASE)
        target = soup.find_all(string=replace_regex)
        for v in target:
            # You can use other conditions here, like (v.parent.name == 'a')
            # to not wrap the tags around strings within links
            if v.parent.has_attr('class') and 'replace' in v.parent['class']:
                # The match must be part of a large string that was already replaced, so do nothing
                continue 

            def replace(match):
                return '{0}{1}{2}'.format(item[1], match.group(0), item[2])

            new_v = replace_regex.sub(replace, v)
            v.replace_with(BeautifulSoup(new_v, 'html.parser'))
    return str(soup)

When you are dealing with small files, it is good to read the file line by line, and replace in each line what you want to replace, then write everything to a new file. 处理小文件时,最好逐行读取文件,然后在每一行中替换要替换的内容,然后将所有内容写入新文件。

Assuming your file is called output.html : 假设您的文件名为output.html

replacement_list = {'foo': '<span title="foo" class="customclass34">foo</span>', 'foo bar':'<span id="id21" class="customclass79">foo bar</span>'}

with open('output.html','w') as dest :
    with open('test.html','r') as src :
        for line in src:   #### reading the src file line by line
            str_possible = []
            for string in replacement_list.keys(): #### looping over all the strings you are looking for
                if string in line: ### checking if this string is in the line
                    str_possible.append(string)
            if len(str_possible) >0:
                str_final = max(str_possible, key=len)  ###taking the appropriate one, which is the longest
                line = line.replace(str_final,replacement_list[str_final])

            dest.write(line)

I also suggest you check the use of dictionaries in python, which is the object that I use for replacement_list . 我还建议您检查python中字典的使用,这是我用于replacement_list的对象。

Finally, this code will work, if there is at the maximum one string on the line. 最后,如果该行上最多有一个字符串,则此代码将起作用。 If there is two, it needs to be adapted a bit, but this gives you the overall idea. 如果有两个,则需要进行一些调整,但这可以为您提供总体思路。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM