简体   繁体   English

REGEX返回字符串中所有大写短语的列表

[英]REGEX to return list of all Capitalized phrases in a String

Hi I've been fooling around with this for awhile figured it was time to ask for help ... 嗨,我一直在鬼混一段时间,以为是时候寻求帮助了...

I'm trying to return all capital char (non numeric or special char phrases) sequences longer then 5 characters from a wacky a string. 我正在尝试从古怪的字符串中返回所有大写字符(非数字或特殊字符短语)序列,然后返回5个字符以上。

so for: 因此对于:

02/02/12-02:45 PM(CKI)-DISC RSPNS SRVD 01/31/12-PRINTED DISCOVERY:spina.bp.doc(DGB)   
01/27/12-ON CAL-FILED NOTICE OF TRIAL(JCX) 01/24/12-SENT OUR DEMANDS(Auto-Gen) 01/23/12-
02:31  PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af

I would want to return a list of 我想返回一个清单

DISC RSPNS SRVD 光盘RSPNS SRVD

PRINTED DISCOVERY 印刷发现

FILED NOTICE OF TRIAL 审理通知书

SENT OUR DEMANDS 发送我们的需求

I've been fooling around with variations of the following: 我一直在鬼混以下变化:

[A-Z][A-Z\d]+ 
[A-Z][A-Z\d]+ [A-Z][A-Z\d]+"

however this is a little outside my scope of knowledge with Regex. 但是,这超出了我在Regex的知识范围。

Edit 编辑

I'm trying 我正在努力

string[] capWords = Regex.Split(d.caption, @"[A-Z\s]{5,}");
foreach (var u in capWords) { Console.WriteLine(u); }

Outputting: 输出:

02/02/12-02:45 PM(CKI)- 01/31/12- 02/02 / 12-02:45 PM(CKI)-01/31 / 12-

:spina.bp.doc(DGB) 01/27/12- :spina.bp.doc(DGB)01/27 / 12-

(JCX) 01/24/12- (Auto-Gen) 01/23/12-02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af (JCX)01/24 / 12-(Auto-Gen)01/23 / 12-02:31 PM-文件被拉出并交给KG进行响应。(JLS)01/20/12(PC)-rcd df jmt af

Kendall's Suggestion Outputs: 肯德尔的建议输出:

02/02/12-02:45 PM(CKI)- 01/31/12- 02/02 / 12-02:45 PM(CKI)-01/31 / 12-

:spina.bp.doc(DGB) 01/27/12- :spina.bp.doc(DGB)01/27 / 12-

(JCX) 01/24/12- (Auto-Gen) 01/23/12-02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af (JCX)01/24 / 12-(Auto-Gen)01/23 / 12-02:31 PM-文件被拉出并交给KG进行响应。(JLS)01/20/12(PC)-rcd df jmt af

Here you go: 干得好:

[AZ\\s]{5,} [AZ \\ S] {5,}

Tested and returns only the items you listed. 测试并仅返回您列出的项目。

Explanation: 说明:

[AZ\\s] - matches only capital letters and spaces [AZ\\s] -仅匹配大写字母和空格

{5,} - matches must be at least 5 characters, with no upper limit on number of characters {5,} -匹配项必须至少包含5个字符,并且字符数没有上限

Code: 码:

MatchCollection matches = Regex.Matches(d.caption, @"[A-Z\s]{5,}");
foreach (Match match in matches)
{
    Console.WriteLine(match.Value);
}

Try this. 尝试这个。 I am assuming you want leading/trailing spaces stripped. 我假设您要删除开头/结尾空格。

[A-Z][A-Z ]{4,}[A-Z]

Also, I don't think you want Regex.Split. 另外,我认为您不需要Regex.Split。

var matches = Regex.Matches(d.caption, @"[A-Z][A-Z ]{4,}[A-Z]");
foreach (var match in matches)
{
    Console.WriteLine(match.Value);
}

You could also do: 您也可以这样做:

var matches = Regex.Matches(d.caption, @"[A-Z][A-Z ]{4,}[A-Z]")
                   .OfType<Match>()
                   .Select(m => m.Value);
foreach (string match in matches)
{
    Console.WriteLine(match);
}

You had asked for a single RegEx solution but using given criteria and examples I could not get a single reg ex to count a string and ignore a certain character type (spaces). 您曾要求一个RegEx解决方案,但是使用给定的条件和示例,我无法获得单个reg ex来计算字符串并忽略某个字符类型(空格)。 Failure was on character groups like ON CAL which should fail as a match but were passing because of the total character count. 像ON CAL这样的字符组失败了,该字符组应作为匹配项而失败,但由于字符总数而通过了。

So in order to make sure that character groups with only 5 Uppercase characters were present I had to use two regEx expressions. 因此,为了确保只包含5个大写字符的字符组,我不得不使用两个regEx表达式。 This was a little cumbersome and I was able to do this faster and much simpler with string methods. 这有点麻烦,而且我可以使用字符串方法更快,更简单地完成此操作。

This might work with a single regEx if you could list some certainties about the formatting of the source text. 如果您可以列出有关源文本格式的某些确定性,则这可能与单个regEx一起使用。 For example if we knew that the character groups that you are looking for are always preceded by a dash and terminated by a punctuation mark that is not a dash, or terminated by a number. 例如,如果我们知道您要查找的字符组始终以破折号开头,并以非破折号的标点符号结尾,或者以数字结尾。

5 PM( --- FAIL (not preceded by a dash) 5 PM(--失败(不带破折号)

(CKI) --- FAIL (not preceded by a dash) (CKI)---失败(不带破折号)

-DISC RSPNS SRVD 0 --- PASS -DISC RSPNS SRVD 0 ---通过

-PRINTED DISCOVERY: --- PASS 专有发现:---通过

-ON CAL- --- FAIL (terminated by a dash) -ON CAL- ---失败(以破折号结尾)

-FILED NOTICE OF TRIAL( --- PASS 备案通知书(---通过

-SENT OUR DEMANDS( --- PASS -发送我们的需求(---通过

Barring that, I have included the code that will get you your results in one of two ways. 除此以外,我已经包含了将以两种方式之一为您提供结果的代码。 I prefer the second. 我喜欢第二个。

        String source1 = "02/02/12-02:45 PM(CKI)-DISC RSPNS SRVD 01/31/12-PRINTED
 DISCOVERY:spina.bp.doc(DGB) 01/27/12-ON CAL-FILED NOTICE OF TRIAL(JCX) 01/24/12-SENT
 OUR DEMANDS(Auto-Gen) 01/23/12- 02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af ";

    String assembledString;

    public void bumbleBeeTunaTest()
    {
        String strippedString = source1.Replace(" ", "");

        String regString1 = ""; 
        String regString2 = @"([A-Z]{6,})";
        String matchHold1,matchHold1First,matchHold1Last,matchHold1Middle;
        Int32 matchHold1Len;


        Regex regExTwo = new Regex(regString2);

        MatchCollection regMatch2 = regExTwo.Matches(strippedString);


        foreach (Match match2 in regMatch2)
        {
            matchHold1 = match2.Groups[1].Value;
            matchHold1Len = matchHold1.Length;
            matchHold1First = matchHold1.Substring(0,1);
            matchHold1Last = matchHold1.Substring(matchHold1Len - 1,1);
            matchHold1Middle = matchHold1.Substring(1, matchHold1Len - 2);


            Debug.Print("Stripped String Matches - " + matchHold1);


            regString1 = @"(" + matchHold1First + "[" + matchHold1Middle+  " ]{" + (matchHold1Len -1) + ",}" + matchHold1Last + ")";

            Regex regExOne = new Regex(regString1);

            MatchCollection regMatch1 = regExOne.Matches(source1);

            regMatch1 = regExOne.Matches(source1);



            foreach (Match match1 in regMatch1)
            {

                Debug.Print("Re-Assembled Matches :" + match1.Groups[1].Value.ToString());
            }

        }

        // Does the same thing as the above.  Just a little simpler.
        for (int i = 0; i < source1.Length; i++)
        {
            if (char.IsUpper(source1[i]) | char.IsWhiteSpace(source1[i]))
            {
                assembledString += source1[i];
            }
            else
            {
                if (!string.IsNullOrEmpty(assembledString))
                {
                    if (assembledString.Count(char.IsUpper) > 5)
                    {
                        Debug.Print("Non Reg Ex Version "  + assembledString);
                    }
                    assembledString = "";
                }
            }
        }
    }

The output looks like this. 输出看起来像这样。

Stripped String Matches - DISCRSPNSSRVD
Re-Assembled Matches :DISC RSPNS SRVD
Stripped String Matches - PRINTEDDISCOVERY
Re-Assembled Matches :PRINTED DISCOVERY
Stripped String Matches - FILEDNOTICEOFTRIAL
Re-Assembled Matches :FILED NOTICE OF TRIAL
Stripped String Matches - SENTOURDEMANDS
Re-Assembled Matches :SENT OUR DEMANDS
Non Reg Ex Version DISC RSPNS SRVD 
Non Reg Ex Version PRINTED DISCOVERY
Non Reg Ex Version FILED NOTICE OF TRIAL
Non Reg Ex Version SENT OUR DEMANDS

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM