简体   繁体   English

正则表达式提取字符串的一部分

[英]Regex to extract parts of string

I'm setting up some campaign codes which will appear as a query parameter in a URL. 我正在设置一些广告系列代码,这些代码将作为查询参数显示在URL中。 I'd like to automate the reporting of these campaign codes and have set them up in such a way that each parameter within the code has a specific set of values, which are recognised in the system via a look up. 我想自动报告这些广告系列代码,并以如下方式设置它们:代码中的每个参数都有一组特定的值,这些值可以通过查找在系统中识别出来。 However, the end part of the string is free text. 但是,字符串的结尾部分是自由文本。 Here's an example: 这是一个例子:

socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1

As explained previously parameters 1-7 can be a number of different values that are already known to the system and I can just use a contains query to extract each of these values and use them in a look up to get their report friendly names. 如前所述,参数1-7可以是系统已知的许多不同值,我可以仅使用一个contains查询来提取这些值中的每一个,并在查询中使用它们来获取其报告友好名称。 However, how can I extract the last part of the string eg mffs201403_sbj1 which is optional, but will always be free text with variable length and will always appear after the 7th colon. 但是,如何提取字符串的最后部分,例如mffs201403_sbj1 ,这是可选的,但始终是长度可变的自由文本,并且总是出现在第7个冒号之后。

In addition, is there a way to capture the mffs201403 bit only where I always use an underscore to separate the two parts at the end? 另外,是否有一种方法只能在我总是使用下划线将结尾的两个部分分开的情况下捕获mffs201403位? This is because the first part identifies an individual campaign, where as the second part identifies a variant of that campaign, if it exists. 这是因为第一部分标识单个广告系列,第二部分标识该广告系列的变体(如果存在)。 So I'd like to report on all campaign variants, eg mffs201403_sbj1 , mffs201403_sbj2 , etc, as well as mffs201403 as a whole. 因此,我想报告所有广告系列的变体,例如mffs201403_sbj1mffs201403_sbj2等,以及整个mffs201403

I been trying to get my head around Regex for the longest time and I've been unable to master it, so if anyone can help me with this I'd be extremely grateful 我一直在努力使自己在Regex上度过最长的时光,但我一直无法掌握,所以如果有人可以帮助我,我将非常感激

I'm not sure what language you use, but this works fine in c#: 我不确定您使用哪种语言,但这在c#中可以正常工作:

var input = "socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1";
var pattern = "^(?:[^:]+:){7}(?<last>(?<part1>[^_]+)_(?<part2>[^_]+))+$";
var match = Regex.Match(input, pattern);

if (match.Success)
{
    Console.WriteLine("Last: {0}", match.Groups["last"].Value);
    Console.WriteLine("Part1: {0}", match.Groups["part1"].Value);
    Console.WriteLine("Part2: {0}", match.Groups["part2"].Value);
}

It outputs: 它输出:

Last: mffs201403_sbj1
Part1: mffs201403
Part2: sbj1

The regex works by finding "any characters other than : " followed by a : , and repeats this 7 times. 正则表达式的工作原理是找到“ :以外的任何字符”,后跟一个: ,并将其重复7次。 Then it looks for "any character other than _ ", divided by a _ , and puts the last parts in separate subgroups to easily extract them in code. 然后,它查找“ _以外的任何字符”,再除以_ ,然后将最后部分放在单独的子组中,以轻松地将它们提取到代码中。

If you use some kind of third party tool that just takes a regex, i guess this will work better: 如果您使用某种只需要使用正则表达式的第三方工具,我想这会更好:

^(?:[^:]+:){7}([^_]*)_?([^_]*)$

The subgroups 1 and 2 will contain the two parts of the last variable, but it will also handle cases where there is no last variable, or it doesn't contain a _ , or any of the parts before and after the _ is empty. 子组1和2将包含最后一个变量的两个部分,但它也将处理情况下,有没有最后一个变量,或者它不包含_ ,或任何部分的前后_是空的。

In order to just match the last variable, and nothing else, this regex can be used: 为了只匹配最后一个变量,仅此而已,可以使用此正则表达式:

[^:]*$

$ is the end of the string, and we match everything before this that isn't a : . $是字符串的结尾,我们将匹配所有不是:的字符。

However, to match something in the middle of the string, without also matching the surrounding characters, it gets a bit tricky, and maybe even impossible with regex. 但是,要匹配字符串中间的某些内容而不匹配周围的字符,则它会变得有些棘手,而使用正则表达式则可能无法实现。 If you know that the string will never contain any _ , except for in the last variable, you could use: 如果您知道该字符串将永远不包含任何_ ,除了最后一个变量,则可以使用:

[^:]*_

Which works pretty much the same, but will always include the _ in the match. 两者的工作原理几乎相同,但比赛中将始终包含_

Something like so should work for you: (\\w+:){7}([^_]+)_(\\w+) . 像这样的东西应该为您工作: (\\w+:){7}([^_]+)_(\\w+)

This regular expression expects to find a string which is separated by an underscore after a repetition of 7 groups of word characters (denoted by \\w which means upper case letters, lower case letters numbers and underscores). 该正则表达式期望找到重复7组字字符(用\\w表示大写字母,小写字母数字和下划线)后由下划线分隔的字符串。

If the last segment does not exist, then, the regular expression will fail. 如果最后一个段不存在,则正则表达式将失败。 A working example can be found here . 一个有效的例子可以在这里找到。

In Java this would translate to: 在Java中,这将转换为:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("(\\w+:){7}([^_]+)_(\\w+)");
    String str1 = "socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1";
    String str2 = "socfb:obb:img:beg:rp:lo:mff";

    Matcher m1 = p.matcher(str1);
    if(m1.find())
    {
        System.out.println(m1.group(2));
        System.out.println(m1.group(3));
    }
    else
    {
        System.out.println("No content found for " + str1);
    }

    Matcher m2 = p.matcher(str2);
    if(m2.find())
    {
        System.out.println(m2.group(2));
        System.out.println(m2.group(3));
    }
    else
    {
        System.out.println("No content found for " + str2);
    }
}

Yields: 产量:

mffs201403
sbj1
No content found for socfb:obb:img:beg:rp:lo:mff

Not quite a direct answer to your quesion but: If this is done within a script then you don't really need to use a regex. 并不是对问题的直接答案,而是:如果在脚本中完成此操作,则实际上不需要使用正则表达式。 Whichever programming language you're using should have a string splitting function which will be easier to use and much more readable. 无论您使用哪种编程语言,都应具有字符串拆分功能,该功能将更易于使用且可读性更高。

For example in python: 例如在python中:

strings = query_parameter.split(":")
final_string = strings[-1]

then to split up that string: 然后拆分该字符串:

campaign = final_string.split("_")[0]
try:
    variant = final_string.split("_")[1]
except IndexError:
    variant = ""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM