Python's regex, ascii escape character to tags

Question

I have the following Xterm's output:

text = '\x1b[0m\x1b[01;32mattr\x1b[0m\n\x1b[01;36mawk\x1b[0m\n\x1b[01;32mbasename\x1b[0m\n\x1b[01;32mbash\n\x1b[0many text'

I known that \\x1b[0m is to remove all text attributes and \\x1b[01 if for bold text, \\x1b[32m is green text and \\x1b[01;32m is a bold green text. So how can I pass those escape characters to my own tags? Like this:

\x1b[0m\x1b[01;32mattr --> <bold><green>attr</bold></green>

I want that my text variable become this:

text = '<bold><green>attr</bold></green>\n<bold><cyan>awk</bold></cyan>\n<bold><green>basename</bold></green>\n<bold><green>bash</bold></green>\nanytext'

Answer 1

import re

text = '\x1b[0m\x1b[01;32mattr\x1b[0m\n\x1b[01;36mawk\x1b[0m\n\x1b[01;32mbasename\x1b[0m\n\x1b[01;32mbash\n\x1b[0many text'

# dictionary mapping text attributes to tag names
fmt = {'01':'bold', '32m':'green', '36m': 'cyan'}
# regex that gets all text attributes, the text and any potential newline
groups = re.findall('(\n?)\\x1b\[((?:(?:0m|32m|01|36m);?)+)([a-zA-Z ]+)', text)
# iterate through the groups and build your new string
xml = []
for group in groups:
    g_text = group[2] # the text itself
    for tag in group[1].split(';'): # the text attributes 
        if tag in fmt:
            tag = fmt[tag]
        else:
            continue
        g_text = '<%s>%s</%s>' %(tag,g_text,tag)
    g_text = group[0] + g_text # add a newline if necessary
    xml.append(g_text)
xml_text = ''.join(xml)

print(xml_text)

<green><bold>attr</bold></green>
<cyan><bold>awk</bold></cyan>
<green><bold>basename</bold></green>
<green><bold>bash</bold></green>
any text

For a demo on the regex see this link: Debuggex Demo

Currently the regex assumes that you only have alpha characters or spaces in the actual text but feel free to change this group ([a-zA-Z ]+) at the end of the regex to include other characters that you may have in your text.

Also, I'm assuming you have more text attributes than bold, green, and cyan. You will need to update the fmt dictionary with your other attributes and their mappings.

EDIT

@Caaarlos' has requested in the comments (below) to keep the ansi code as is in the output if it doesn't appear in the fmt dictionary:

import re

text = '\x1b[0m\x1b[01;32;35mattr\x1b[0;7m\n\x1b[01;36mawk\x1b[0m\n\x1b[01;32;47mbasename\x1b[0m\n\x1b[01;32mbash\n\x1b[0many text'

fmt = {'01':'bold', '32':'green', '36': 'cyan'}

xml = []
active_tags = []
for group in re.split('\x1b\[', text):
    if group.strip():
        codes, text = re.split('((?:\d+;?)+)m', group)[1:]
        not_found = []
        for tag in codes.split(';'):
            if tag in fmt:
                tag = fmt[tag]
                text = '<%s>%s' %(tag,text)
                active_tags.append(tag)
            elif tag == '0':
                for a_tag in active_tags[::-1]:
                    text = '</%s>%s' %(a_tag,text)
                active_tags = []
            else:
                not_found.append(tag)
        if not_found:
            text = '\x1b[%sm%s' %(';'.join(not_found), text)
        xml.append(text)
xml_text = ''.join(xml)

print(repr(xml_text))

'\x1b[35m<green><bold>attr\x1b[7m</bold></green>\n<cyan><bold>awk</bold></cyan>\n\x1b[47m<green><bold>basename</bold></green>\n<green><bold>bash\n</bold></green>any text'

Note that the edited code above also handles cases where the tag isn't closed directly after the text.

Python's regex, ascii escape character to tags

Question

1 answers

solution1
1 ACCPTED 2016-12-14 15:51:12

Python's regex, ascii escape character to tags

Question

1 answers

solution1 1 ACCPTED 2016-12-14 15:51:12

solution1
1 ACCPTED 2016-12-14 15:51:12