简体   繁体   中英

How to split HTML text ignoring spaces in tags

I have html text like so:

myHTML = 'I like <a class="thing1 thing2">this thing</a>'
myHTMLarray = myHTML.Split(' ')
>>>['I','like','<a','class="thing1','thing2">this','thing</a>']

I need to ignore the spaces in tags (anything between '<' and '>'). My desired result would be:

>>>['I','like','<a class="thing1 thing2">this','thing</a>']

Ideally, I would like to ensure that exactly one word from the text is in each element of the list. Thus break tags or span tags without text other than a space would get included with the previous word.

Basically you want to ignore spaces inside tags. To do that, you need to keep track of beginning and closing tag angle brackets and to detect spaces occuring elsewhere, but not between the brackets.

Once we have only significant spaces, we can detect space/word and word/space boundaries and extract all words using slices.

def mysplit(html):
    in_tag = False
    in_word = False
    for i, ch in enumerate(html):
        if ch == '<':
            in_tag = True
        elif ch == '>':
            in_tag = False
        space = ch.isspace() and not in_tag
        if not in_word and not space:
            in_word = True
            begin = i
        elif in_word and space:
            in_word = False
            yield html[begin:i]
    if in_word:
        yield html[begin:]

testhtml = 'I like <a class="thing1 thing2">this thing</a>'
print(list(mysplit(testhtml)))
# prints: ['I', 'like', '<a class="thing1 thing2">this', 'thing</a>']

Edit: I made a small change to the code posted originally to increase readability a little bit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM