简体   繁体   中英

regex to match different words and everything after minus ending

I have several variations of slug lines and I need to get the middle part of it. Luckily the pattern has only a few variations but I can't get it to work for all variations.

1 INT. HIGH SCHOOL - DAY 1
EXT. HOUSE - NIGHT
2A INT. HOSPITAL - NIGHT 2A
3. EXT. AIRPORT - DAY 3.
4B. INT. MALL - NIGHT 4B.

What I would like to achieve is having the string starting from INT or EXT right till the last word, not including the number/letter or dot combination. I would like to have this:

INT. HIGH SCHOOL - DAY 
EXT. HOUSE - NIGHT
INT. HOSPITAL - NIGHT
EXT. AIRPORT - DAY
INT. MALL - NIGHT   

Is there a clean way of doing this in regex

The best I get is using this:

@"(?:INT|EXT:).*$")

Unfortunately it only returns a string starting at INT up till the end, but doesn't work with EXT and doesn't get rid of the ending number/letter or dot.

You don't need to use Regex - a working linq solution:

var str = "1 INT.HIGH SCHOOL -DAY 1";
var newStr = String.Join(" ",str.Split().Where(s => !s.Any(c => Char.IsDigit(c)))).Trim();
Console.WriteLine(newStr);  // INT.HIGH SCHOOL -DAY

You could try this one :

((?:INT|EXT).*?)\s*\S*$
  • (?:INT|EXT) : Matches INT or EXT
  • .*? : Matches everyhing
  • \\s*\\S*$ : Matches the last characters of the line (But it isn't included inside de matching part)

EXAMPLE

This works and gives the result you need:

@".*((?:INT. |EXT. )[A-Za-z\\. \\-]+).*$"

Here is how to use it:

var vMatch = Regex.Match("1 INT. HIGH SCHOOL - DAY 1", @".*((?:INT. |EXT. )[A-Za-z\. \-]+).*$");
var extracted = vMatch.Groups[1].Value.Trim();

extracted contains INT. HIGH SCHOOL - DAY INT. HIGH SCHOOL - DAY as per requirement

https://regex101.com/r/zC8mG5/9

replace: (\d\w?\.? ?)(.*)\1
     to: \2

does this fit you?

Here's a non regex approach that works as expected:

 string[] prefixes = { "INT", "EXT" };
 for (int i = 0; i < list.Count; i++)
 {
    string oldS = list[i].Trim();
    int indexOflastSpace = oldS.LastIndexOf(' ');
    int endIndex = oldS.Length - 1;
    if(indexOflastSpace >= 0)
    {
        string rest = oldS.Substring(indexOflastSpace).TrimStart();
        // starts the last token with a digit?
        if(char.IsDigit(rest[0]))
            endIndex = indexOflastSpace;
    }
    int start = 0;
    int indexOfAnyPrefix = prefixes
        .Select(p => oldS.IndexOf(p, StringComparison.InvariantCultureIgnoreCase))
        .Where(index => index >= 0)
        .DefaultIfEmpty(-1)
        .First();
    if(indexOfAnyPrefix > 0)
        start = indexOfAnyPrefix;
    string newS = oldS.Substring(start, endIndex - start);
    list[i] = newS;
}

An alternative with Regex and Linq ( try it online ):

string s = @"1 INT. HIGH SCHOOL - DAY 1
EXT. HOUSE - NIGHT
2A INT. HOSPITAL - NIGHT 2A
3. EXT. AIRPORT - DAY 3.
4B. INT. MALL - NIGHT 4B.";

const string startWithNum = @"^\d";
foreach (var line in s.Split('\r', '\n').Select(item => new List<string>(item.Split(' '))))
{
    if (Regex.IsMatch(line[0], startWithNum))
        line.RemoveAt(0);
    if (Regex.IsMatch(line[line.Count - 1], startWithNum))
        line.RemoveAt(line.Count - 1);
    Console.WriteLine(String.Join(" ", line));
}

output:

INT. HIGH SCHOOL - DAY
EXT. HOUSE - NIGHT
INT. HOSPITAL - NIGHT
EXT. AIRPORT - DAY
INT. MALL - NIGHT

This would be my approach. I like to use the IgnorePatternWhitespace option to improve the readability of the expression.

I'm showing the data in one chunk, but it will also work if you are processing it line-by-line.

var text = "1 INT. HIGH SCHOOL - DAY 1" + Environment.NewLine;
text += "EXT. HOUSE - NIGHT" + Environment.NewLine;
text += "INT. HOSPITAL - NIGHT 2A" + Environment.NewLine;
text += "3. EXT. AIRPORT - DAY 3." + Environment.NewLine;
text += "4B. INT. MALL - NIGHT 4B." + Environment.NewLine;

var options = RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace;
var regex = new Regex("^ .*? (?<slug> (?:INT|EXT)\\. .*?) (?:\\s+?\\d.*?)? $", options );

var matches = regex.Matches( text );

foreach( Match m in matches ){
    Console.WriteLine( "|" + m.Groups["slug"].Value + "|" );
}

Produces:

|INT. HIGH SCHOOL - DAY|
|EXT. HOUSE - NIGHT |
|INT. HOSPITAL - NIGHT|
|EXT. AIRPORT - DAY|
|INT. MALL - NIGHT|

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM