简体   繁体   中英

Remove all occurrences in a string except the first occurrence

In Python, I'm looking to remove all the " <html> " from a string (except for the first occurrence).

Also, I'm looking to remove all the " </html> " from a string (except for the last occurrence).

<html> can be uppercase, so I need it to be case insensitive.

What is my best approach?

To remove all but the first occurrence of <html> from the string s , you can use the follwoing code:

substr = "<html>"
try:
    first_occurrence = s.index(substr) + len(substr)
except ValueError:
    pass
else:
    s = s[:first_occurrence] + s[first_occurrence:].replace(substr, "")

All but the last occurrence of </html> can be removed in an similar manner:

substr = "</html>"
try:
    last_occurrence = s.rindex(substr)
except ValueError:
    pass
else:
    s = s[:last_occurrence].replace(substr, "") + s[last_occurrence:]

You might want to replace the occurrences with a space rather than the empty string.

This solution uses two regexes. The first regex splits the entire file/string into three chunks:

  1. The first chunk, (captured into group $1 ) is everything from the start of the string up through and including the first HTML start tag.
  2. The second chunk, (captured into group $2 ) is everything after the first HTML start tag up to the start of the last HTML close tag.
  3. The third chunk, (captured into group $3 ) includes the last HTML end tag and everything that follows up to the end of the file/string.

The function first attempts to match the regex to the input text. If this matches, the contents of the outermost HTML element (which was previously captured in group 2) are then stripped of any HTML start and end tags using the second regex. The string is then reassembled using the three chunks (with the middle chunk having been stripped of HTML tags).

def stripInnermostHTMLtags(text):
    '''Strip all but outermost HTML start and end tags.
    '''
    # Regex to match outermost HTML element and its contents.
    p_outer = re.compile(r"""
        ^                 # Anchor to start of string.
        (.*?<html[^>]*>)  # $1: Outer HTML start tag.
        (.*)              # $2: Outer HTML element contents.
        (</html\s*>.*)    # $3: Outer HTML end tag.
        $                 # Anchor to end of string.
        """, re.DOTALL | re.VERBOSE | re.IGNORECASE)
    # Split text into outermost HTML tags and its contents.
    m = p_outer.match(text)
    if m:
        # Regex to match HTML element start or end tag.
        p_inner = re.compile("</?html[^>]*>", re.IGNORECASE)
        # Strip contents of any/all HTML start and end tags.
        contents = p_inner.sub("", m.group(2))
        # Put string back together stripped of inner HTML tags.
        text = m.group(1) + contents + m.group(3)
    return text

Note that this solution correctly handles any attributes that may be in the HTML start tags. Note also that this solution does NOT handle HTML tags having attributes with values containing the > character (but this should be very rare).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM