Disallow regular expression conversion between some tags

Question

Data

I try to write converter Markdown to BBCode for Sublime Text.
My full code — https://github.com/Kristinita/1Sasha1MarkdownNoBBCode/blob/master/MarkdownToBBCode.py :

#
#   Base on codes:
#   https://gist.github.com/sma/1513929
#   https://github.com/m3mnoch/MarkdownToBBCode/blob/master/MarkdownToBBCode.py
#


import sublime, sublime_plugin
import re, sys

class MarkdowntobbcodeCommand(sublime_plugin.TextCommand):
    def run(self, edit):
        allcontent = sublime.Region(0, self.view.size())

        regionString = self.markdown_to_bbcode(str(self.view.substr(allcontent)))
        self.view.replace(edit, allcontent, regionString)

    def markdown_to_bbcode(self,s):

        def translate(p="%s", g=1):
            def inline(m):
                s = m.group(g)
                #
                # Headers
                #
                s = re.sub(r"^#\s+(.*?)\s*$", "[h1]\\1[/h1]", s)   # # Header first level
                s = re.sub(r"^##\s+(.*?)\s*$", "[h2]\\1[/h2]", s)  # ## Header second level
                s = re.sub(r"^###\s+(.*?)\s*$", "[h3]\\1[/h3]", s) # ### Header third level
                s = re.sub(r"^####\s+(.*?)\s*$", "[h4]\\1[/h4]", s)# #### Header fourth level
                #
                # Lists
                #
                s = re.sub(r"(?m)^[-+*]\s+(.*)$", translate("№[list]\n[*]%s\n[/list]"), s) # + Marked + List
                s = re.sub(r"(?m)^\d+\.\s+(.*)$", translate("№[list=1]\n[*]%s\n[/list]"), s) # 1. Numbered 2. List
                #
                # Quote
                #
                s = re.sub(r"^> (.*)$", "[quote]\\1[/quote]", s) # > Quote
                #
                # Thematic break
                #
                s = re.sub(r"^-{3}(\s*)$", "[hr]", s)
                return p % s
            return inline

        #
        # URL and images
        #
        s = re.sub(r"!\[(.*?)\]\((.*?)\)", "[img]\\2[/img]", s)   # ![IMG description](URL of image), alt attribute not supported in many forums
        s = re.sub(r"\[(.*?)\]\((.*?)\)", "[url=\\2]\\1[/url]", s)# [URL description](URL link)
        s = re.sub(r"<(https?:\S+)>", "[url]\\1[/url]", s)        # <URL>
        #
        # Code
        #
        s = re.sub(r"`{3}([^`]+)`{3}", "[code]\\1[/code]", s)# ```Multiline\n\code```
        s = re.sub(r"`([^`]+)`", "[code]\\1[/code]", s)      # `Code`
        s = re.sub(r"(?m)^ {4}(.*)$", "№[code]\\1[/code]", s)# Code fragment after 4 spaces
        s = re.sub(r"(?m)^\t(.*)$", "№[code]\\1[/code]", s)  # Code fragment after tab
        #
        # Bold and italic
        #
        s = re.sub(r"_{2}([\s\S]+?)_{2}", "[b]\\1[/b]", s)  # __Bold__
        s = re.sub(r"_([^_]+?)_", "[i]\\1[/i]", s)          # _Italic_ Needs hack (?<=\s), because _ symbol often use in URLs
        s = re.sub(r"\*{2}([\s\S]+?)\*{2}", "[b]\\1[/b]", s)# **Bold**
        s = re.sub(r"\*([^\*]+?)\*", "[i]\\1[/i]", s)       # *Italic*.
        #
        # Strikethrough text
        #
        s = re.sub(r"~{2}([\s\S]+?)~{2}", "[s]\\1[\s]", s)
        #
        # Dependencies. Not delete these lines!
        #
        s = re.sub(r"(?m)^((?!№).*)$", translate(), s)
        s = re.sub(r"(?m)^№\[", "[", s)
        s = re.sub(r"\[/code]\n\[code(=.*?)?]", "\n", s)
        s = re.sub(r"\[/list]\n\[list(=1)?]\n", "", s)
        s = re.sub(r"\[/quote]\n\[quote]", "\n", s)

        return s

Examples of conversation: https://github.com/Kristinita/1Sasha1MarkdownNoBBCode .
'Sasha' convert to [code]Sasha[/code] , *inaccessible height* convert to [i]inaccessible height[/i] and so on.

(For technical reasons I replace backtick to apostrophe — ' in some places of this question.)

Problem

Example markdown text

[Sasha fantastic](http://example.com#example_url)

```
Example* of code
```

*Sasha* the _best_!

Expected conversion

[url=http://example.com#example_url]Sasha fantastic[/url]

[code]
Example* of code
[/code]

[i]Sasha[/i] the [i]best[/i]!

Actual conversion

[url=http://example.com#example[i]url]Sasha fantastic[/url]

[code]
Example[i] of code
[/code]

[/i]Sasha* the [/i]best_!

If I have my URL or code symbols _ or * , conversion bad for me.

Expected behavior

Run URL conversion and code conversion.

s = re.sub(r"'{3}([^']+)'{3}", "[code]\\\\1[/code]", s)

and

s = re.sub(r"\\[(.*?)\\]\\((.*?)\\)", "[url=\\\\2]\\\\1[/url]", s)

Between [code][/code] and [url][/url] no more conversion any symbols.
If my markdown text no between [code][/code] and [url][/url] its convert to BBCode.

Question

How I can realize my expected behavior? I'm sorry, I'm not Python programmer, but I needed this plugin.

Thanks.

Answer 1

I fixed the mentioned problem and added some documentation. On GitHub, you may see the commit history on the pull request into your repository:

https://github.com/Kristinita/1Sasha1MarkdownNoBBCode/pull/2

I think there are more problems you do not mentioned here, but I am not sure anymore. If there are other problems, they could be noticed on the testing files test_*.md on the GitHub repository: https://github.com/evandrocoan/SublimeTextMarkdowToBBCode/

#
#   Base on codes:
#   https://gist.github.com/sma/1513929
#   https://github.com/m3mnoch/MarkdownToBBCode/blob/master/MarkdownToBBCode.py
#


import sublime, sublime_plugin
import re, sys

class MarkdowntobbcodeCommand( sublime_plugin.TextCommand ):
    """
        You cannot parse the whole conversion from Markdown to BB Code only with regex.

        http://stackoverflow.com/a/1732454/4934640
        http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

        Changelog:

        2.0.1
        Fixed several conversion bugs involving *Italic*, **Bold**, ***Bold-Italic***, etc. Issues:
        https://github.com/Kristinita/1Sasha1MarkdownNoBBCode/issues/1
        https://github.com/evandrocoan/SublimeTextStudio/issues/21

        2.0.0
        Fix issue `Disallow regular expression conversion between some tags`
        https://github.com/Kristinita/1Sasha1MarkdownNoBBCode/issues/1

        1.0.0
        Initial release by `Kristinita`.
    """

    codeBlock = []
    codeBlock.append( "```([^`]+)```"  ) # ```Multiline\n\code```
    codeBlock.append( "`([^`]+)`"      ) # `Code`
    codeBlock.append( "(?m)^\t(.*)$"   ) # Code fragment after tab
    codeBlock.append( "(?m)^ {4}(.*)$" ) # Code fragment after 4 spaces

    #
    # Notice: `alt` attribute not supported in many forums.
    #
    imageRegex      = "!\[(.*?)\]\((.*?)\)" # ![IMG description](URL of image)
    urlRegex        = "<(https?:\S+)>"      # <URL>
    urlAndLinkRegex = "\[(.*?)\]\((.*?)\)"  # [URL description](URL link)

    def run( self, edit ):

        allcontent      = sublime.Region( 0, self.view.size() )
        self.sourceCode = str( self.view.substr( allcontent ) )

        self.view.replace( edit, allcontent, self.markdown_to_bbcode() )

    def markdown_to_bbcode( self ):

        def translate( formattingRule = "%s", targetGroup = 1 ):
            """
                # Here is a closure in python:
                def makeInc(x):
                  def inc(y):
                     # x is "closed" in the definition of inc
                     return y + x

                return inc

                inc5 = makeInc(5)
                inc10 = makeInc(10)

                inc5 (5) # returns 10
                inc10(5) # returns 15

                @param formattingRule, is a string format for the `%` parsing.
                @param targetGroup   , is the group to retrieve from the `matchObject`.

                @return a function pointer.
            """

            # print( "( translate ) targetGroup: {0}, sourceCode: {1}".format( targetGroup, self.sourceCode ) )

            def inline( matchObject ):
                """
                    Recursive function called by the `re.sub` for every non-overlapping occurrence of the pattern.

                    @param matchObject, an `sre.SRE_Match object` match object.
                """

                self.sourceCode = matchObject.group( targetGroup )
                # print( "( inline ) Receiving sourceCode as: `{0}`".format( self.sourceCode ) )

                #
                # Headers
                #
                self.sourceCode = re.sub( r"^#\s+(.*?)\s*$"   , "[h1]\\1[/h1]", self.sourceCode ) # Header first level
                self.sourceCode = re.sub( r"^##\s+(.*?)\s*$"  , "[h2]\\1[/h2]", self.sourceCode ) ## Header second level
                self.sourceCode = re.sub( r"^###\s+(.*?)\s*$" , "[h3]\\1[/h3]", self.sourceCode ) ### Header third level
                self.sourceCode = re.sub( r"^####\s+(.*?)\s*$", "[h4]\\1[/h4]", self.sourceCode ) #### Header fourth level

                # + Marked + List
                self.sourceCode = re.sub( r"(?m)^[-+*]\s+(.*)$", translate("№[list]\n[*]%s\n[/list]")  , self.sourceCode )

                # 1. Numbered 2. List
                self.sourceCode = re.sub( r"(?m)^\d+\.\s+(.*)$", translate("№[list=1]\n[*]%s\n[/list]"), self.sourceCode )

                # > Quote
                self.sourceCode = re.sub(r"^> (.*)$", "[quote]\\1[/quote]", self.sourceCode)

                # Thematic break
                self.sourceCode = re.sub(r"^-{3}(\s*)$", "[hr]", self.sourceCode)

                # Format the `sourceCode` into to `formattingRule` with the `%` formatting rule.
                # print( "( inline ) Leaving   sourceCode as: `{0}`".format( formattingRule % self.sourceCode ) )
                return formattingRule % self.sourceCode

            #
            # The `inline` function is effectively an anonymous function here, since the variable scope in which
            # it was created is no longer accessible to the caller.
            #
            # See:
            # http://ynniv.com/blog/2007/08/closures-in-python.html
            # http://stackoverflow.com/questions/6629876/how-to-make-an-anonymous-function-in-python-without-christening-it
            #
            return inline

        #
        # It re-uses several regexes, then it is needed to parse this before all the others,
        # because it needs several hacks (?<=\s), as `_` symbol often use in URLs.
        #
        self.singleTagContextParser( r"\*\*[^ \n]([^\*]+?)[^ \n](?=\*\*)"              , "b", 2 ) # **Bold**
        self.singleTagContextParser( r"{0}[^ \n]([^{0}]+?)[^ \n](?={0})".format( "\*" ), "i", 1 ) # *Italic*
        self.singleTagContextParser( r"__[^ \n]([^_]+?)[^ \n](?=__)"                   , "b", 2 ) # __Bold__
        self.singleTagContextParser( r"{0}[^ \n]([^{0}]+?)[^ \n](?={0})".format( "_"  ), "i", 1 ) # _Italic_
        self.singleTagContextParser( r"~~[^ \n]([\s\S]+?)[^ \n](?=~~)"                 , "s", 2 ) # ~Strikethrough~

        #
        # Code
        #
        self.sourceCode = re.sub( r"{0}".format( self.codeBlock[ 0 ] ), "[code]\\1[/code]" , self.sourceCode )
        self.sourceCode = re.sub( r"{0}".format( self.codeBlock[ 1 ] ), "[code]\\1[/code]" , self.sourceCode )
        self.sourceCode = re.sub( r"{0}".format( self.codeBlock[ 2 ] ), "№[code]\\1[/code]", self.sourceCode )
        self.sourceCode = re.sub( r"{0}".format( self.codeBlock[ 3 ] ), "№[code]\\1[/code]", self.sourceCode )

        #
        # URL and images.
        #
        self.sourceCode = re.sub( r"{0}".format( self.imageRegex      ), "[img]\\2[/img]"    , self.sourceCode )
        self.sourceCode = re.sub( r"{0}".format( self.urlRegex        ), "[url]\\1[/url]"    , self.sourceCode )
        self.sourceCode = re.sub( r"{0}".format( self.urlAndLinkRegex ), "[url=\"\\2\"]\\1[/url]", self.sourceCode )

        #
        # Dependencies. Not delete these lines!
        # Clean the file using the flag symbol `№` indications.
        #

        # Enters on multiline mode `(?m)`, and asserts a line start `^` with `№` 0 or more times
        # not following it, until the end of the line `$`, i.e., passes a function pointer to
        # `inline`, and call it for every line which does not start with `№`.
        self.sourceCode = re.sub( r"(?m)^((?!№).*)$", translate(), self.sourceCode )

        self.sourceCode = re.sub( r"(?m)^№\["                , "[" , self.sourceCode )
        self.sourceCode = re.sub( r"\[/code]\n\[code(=.*?)?]", "\n", self.sourceCode )
        self.sourceCode = re.sub( r"\[/list]\n\[list(=1)?]\n", ""  , self.sourceCode )
        self.sourceCode = re.sub( r"\[/quote]\n\[quote]"     , "\n", self.sourceCode )

        return self.sourceCode

    def singleTagContextParser( self, regexExpression, bbCodeTag, replacementSize = 1 ):
        """
            Valid entrances are:
            _start without space between the _ and the first and last word.
            multilines are allowed. But one empty line get out the italic mode._

            But is does not have effect inside code blocks, neither URL's.
        """

        # self.sourceCode = re.sub( r"_([^_]+?)_"          , "[i]\\1[/i]", self.sourceCode )
        # for index in range( 0, len( self.sourceCode ) ):

        # Used to count how many iterations are necessary on the worst case scenario.
        matchesIterator = re.finditer( regexExpression, self.sourceCode )

        # The exclusion pattern to remove wrong blocks from being parsed.
        nextSearchPosition = 0
        exceptionRegex     = self.createRegexExceptoin()

        # Iterate until all the initial matches list to finish.
        for element in matchesIterator:

            # To perform a new search on the new updated string.
            match = re.search( regexExpression, self.sourceCode[ nextSearchPosition : ] )

            # Exit the parsing iteration when not more matches are found.
            if match is None:

                break

            # print( "( singleTagContextParser ) Match: {0}".format( match.group( 0 ) ) )

            startIndex = match.start( 0 )
            endIndex   = match.end( 0 )

            startIndex = startIndex + nextSearchPosition
            endIndex   = endIndex   + nextSearchPosition

            nextSearchPosition = startIndex + replacementSize
            # print( "nextSearchPosition: {0}".format( nextSearchPosition ) )

            if self.isWhereItMustNotToBe( startIndex, endIndex, exceptionRegex ):

                continue

            if endIndex + replacementSize > len( self.sourceCode ):

                self.sourceCode = self.sourceCode[ 0 : startIndex ] \
                                  + "[{0}]".format( bbCodeTag ) \
                                  + self.sourceCode[ startIndex + replacementSize : endIndex ] \
                                  + "[/{0}]".format( bbCodeTag )

            else:

                self.sourceCode = self.sourceCode[ 0 : startIndex ] \
                                  + "[{0}]".format( bbCodeTag ) \
                                  + self.sourceCode[ startIndex + replacementSize : endIndex ] \
                                  + "[/{0}]".format( bbCodeTag ) \
                                  + self.sourceCode[ endIndex + replacementSize : ]


    def createRegexExceptoin( self ):

        exceptionRegex  =  ""
        exceptionRegex +=  "(" + self.codeBlock[ 0 ]  + ")"
        exceptionRegex += "|(" + self.codeBlock[ 1 ]  + ")"
        exceptionRegex += "|(" + self.codeBlock[ 2 ]  + ")"
        exceptionRegex += "|(" + self.codeBlock[ 3 ]  + ")"
        exceptionRegex += "|(" + self.imageRegex      + ")"
        exceptionRegex += "|(" + self.urlRegex        + ")"
        exceptionRegex += "|(" + self.urlAndLinkRegex + ")"

        return exceptionRegex


    def isWhereItMustNotToBe( self, startIndex, endIndex, exceptionRegex ):
        """
            Valid entrances are:
            _start without space between the _ and the first and last word.
            multilines are allowed. But one empty line get out the italic mode._

            But is does not have effect inside code blocks, neither URL's.
        """

        matchesIterator = re.finditer( r"{0}".format( exceptionRegex ), self.sourceCode )

        for match in matchesIterator:

            # print( "( isWhereItMustNotToBe ) Match: {0}".format( match.group( 0 ) ) )

            matchStart = match.start( 0 )
            matchEnd   = match.end( 0 )

            if ( ( matchStart <= startIndex ) and ( startIndex <= matchEnd ) ) \
               or ( ( matchStart <= endIndex ) and ( endIndex <= matchEnd ) ):

                # print( "( isWhereItMustNotToBe ) Returning true." )
                return True

        return False

Disallow regular expression conversion between some tags

Question

Data

Problem

Example markdown text

Expected conversion

Actual conversion

Expected behavior

Question

1 answers

solution1
0 2016-11-17 00:00:10

Disallow regular expression conversion between some tags

Question

Data

Problem

Example markdown text

Expected conversion

Actual conversion

Expected behavior

Question

1 answers

solution1 0 2016-11-17 00:00:10

solution1
0 2016-11-17 00:00:10