简体   繁体   中英

Multi-line regex with overlapping matches

I'm working on a tool that parses files for CSS style declarations. It uses a very complicated regular expression that, besides the expected performance issues and a few minor bugs that aren't affecting me for now, is doing everything I'd like it to do except for one thing.

I have it matching all combinations of element names, classes, sub-classes, pseudo-classes, etc. However, when a line contains more than one declaration, I can only get it to match once. As an example, here is the kind of thing that is tripping me up at the moment:

td.class1, td.class2, td.class3
{
    background-color: #FAFAFA;
    height: 10px;
}

I can write an expression that will satisfy this for all of the three declarations, but since I am also capturing information after it (the actual style info within the brackets) I feel like this entire block of text is considered to be accounted for so the engine moves on to the next character following the whole block that just got processed.

Is there a way to accomplish this where each class will be a separate match and all will include the style info that follows as well? I know that I can modify my regex to match the whole line and then parse it for commas after I get my match, but I'd like to keep all my logic inside the expression itself if possible.

I can post the expression and/or the commented code I use to generate it if it's absolutely relevant to the answer, but the expression is huge/ugly (as all non-trivial regexes are) and the code is a bit lengthy.

You need a CSS parser, not a regex. You should probably read Is there a CSS Parser for C# .

Here's a regex that works with your sample data:

@"([^,{}\s]+(?:\s+[^,{}\s]+)*)(?=[^{}]*(\{[^{}]+\}))"

The first part matches and captures a selector (td.class1) in group #1, then the lookahead skips over any remaining selectors and captures the associated style rules in group #2. The next match attempt starts where the lookahead started the previous time, so it matches the next selector (td.class2) and the lookahead captures the same block of rules again.

This won't handle @-rules or comments, but it works fine on the sample data you provided. I even checked it out on some real-world stylesheets and it did remarkably well.

Depending on deep nuances of your regex engine, you may be able to do this by embedding capturing parens in lookaheads, ie something like:

\.(\w+)(?=.*?{([^}]*)})

I'd expect figuring out the meaning of the match groups to be quite an exercise.

This is not a good problem for regexes.

On the other hand, you only need a couple of passes to write a basic CSS parser, surely.

CSS syntax is just [some stuff], [open curly bracket], [some other stuff], [close curly bracket] after all.

You find those two chunks of stuff, you split the first one on commas and the second one on semicolons and you're pretty much done.

I needed to take a similar view to what AmbroseChapel said and I needed it in AS3, so I'm sharing it in case it helps someone else. I tried to be thorough and the comments step you through the process. I've tested it on some popular CSS boiler plate among other things and it works quite well. :) (This is just for listing selector names, not for property parsing.)

    public function getSelectors( targetCSS:String, includeElements:Boolean = true ):ArrayCollection
    {

        var newSelectorCollection:ArrayCollection = new ArrayCollection();

        if( targetCSS == null || targetCSS == "" ) return newSelectorCollection;

        var newSelectors:Array = new Array();

        var elements:Array = new Array();
        var ids:Array = new Array();
        var classes:Array = new Array();

        // Remove comments
        var cssString:String = "";
        var commentParts:Array = targetCSS.split( "/*" );

        for( var c:int = 0; c < commentParts.length; c++ ){

            var comPart:String = commentParts[ c ] as String;

            var comTestArray:Array = comPart.split( "*/" );

            if( comTestArray.length > 1 ){

                comTestArray.shift();
                comPart = comTestArray.join( "" );

            }

            cssString += comPart;

        }

        // Remove \n
        cssString = cssString.split( "\n" ).join( "" );
        // Remove \t
        cssString = cssString.split( "\t" ).join( "" );
        // Split at }
        var cssParts:Array = cssString.split( "}" );

        for( var i:int = 0; i < cssParts.length; i++ ){

            var cssPrt:String = cssParts[ i ] as String;

            // Detect nesting such as media queries by finding more than one {
            var nestingTestArray:Array = cssPrt.split( "{" );

            // If there is nesting split at { then drop index 0 and re-join with {
            if( nestingTestArray.length > 2 ){

                nestingTestArray.shift();
                cssPrt = nestingTestArray.join( "{" );

            }

            // Split at each item at {
            var cssPrtArray:Array = cssPrt.split( "{" );

            // Disregard anything after {
            cssPrt = cssPrtArray[ 0 ] as String;

            // Split at ,
            var selectorList:Array = cssPrt.split( "," );

            for( var j:int = 0; j < selectorList.length; j++ ){

                var sel:String = selectorList[ j ] as String;

                // Split at : and only keep index 0
                var pseudoParts:Array = sel.split( ":" );

                sel = pseudoParts[ 0 ] as String;

                // Split at [ and only keep index 0
                var attrQuryParts:Array = sel.split( "[" );

                sel = attrQuryParts[ 0 ] as String;

                // Split at spaces
                var selectorNames:Array = sel.split( " " );

                for( var k:int = 0; k < selectorNames.length; k++ ){

                    var selName:String = selectorNames[ k ] as String;

                    if( selName == null || selName == "" ){

                        continue;

                    }

                    // Check for direct class applications such as p.class-name
                    var selDotIndex:int = selName.indexOf( ".", 1 );
                    if( selDotIndex != -1 ){

                        // Add the extra classes
                        var dotParts:Array = selName.split( "." );

                        for( var d:int = 0; d < dotParts.length; d++ ){

                            var dotPrt:String = dotParts[ d ] as String;

                            if( d > 0 ){

                                dotPrt = "." + dotPrt;

                                if( d == 1 && selName.indexOf( "." ) === 0 ){

                                    selName = dotPrt;

                                }else{

                                    selectorNames.push( dotPrt );

                                }

                            }else{

                                if( dotPrt != "" ){

                                    selName = dotPrt;

                                }

                            }

                        }

                    }

                    // Only add unique items
                    if( newSelectors.indexOf( selName ) == -1 ){

                        // Avoid @ prefix && avoid *
                        if( selName.charAt( 0 ) != "@" && selName != "*" ){

                            newSelectors.push( selName );

                            switch( selName.charAt( 0 ) ){

                                case ".":
                                    classes.push( selName );
                                    break;

                                case "#":
                                    ids.push( selName );
                                    break;

                                default:
                                    elements.push( selName );
                                    break;

                            }

                        }

                    }

                }

            }

        }

        if( includeElements ){

            newSelectorCollection.source = elements.sort().concat( ids.sort().concat( classes.sort() ) );

        }else{

            newSelectorCollection.source = ids.sort().concat( classes.sort() );

        }

        return newSelectorCollection;

    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM