简体   繁体   中英

REGEX finding strings within a string

I seem to write one Reg expression a year and always end up asking for help.

Here's a string (it's a search string from Solr) and I want to select every instance of the search word.

Here's the input:-

http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A

I need to select any text between every ' %3A ' and ' +OR ' as well as the final ' %3Atheory ))' - in this case the word ' theory ' but it will be a different word every time - the only known thing is it'll be any alpha text between the ' %3A ' and the ' +OR '. And it need to stop at the ' +AND+ '

I've got as far as /%3A(.*?)[+OR]/g - it's a start I guess... It doesn't find ' %3Atheory)) ' and it doesn't stop at ' +AND+ '

I'm struggling with 'find this' OR 'find that' as well as stopping at a string.

anyone offer some guidance?

If you're using it might be better to split in two operations using String.Split and the Regex.Matches like so:

string input = @"http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A";
Regex regex = new Regex(@"%3A(.*?)(?:\+OR|\)\))");

var splitted = input.Split(new[] { "AND" }, StringSplitOptions.None);
var matches = regex.Matches(splitted.First());

foreach (Match m in matches)
{
    // Or whatever you like to do with your matches
    Console.WriteLine(m.Groups[1].Value);
}

Regex.Split has an option to keep the separating strings. So for the text given in the question, code like that below will split it into pieces:

string[] pieces = Regex.Split(theInputText, "(%3A.*?\\+(?:AND|OR))");
foreach (string ss in pieces)
{
    Console.WriteLine(ss);
}

Here is a small section of the output:

+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND
+-(virtualPath
%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR
+virtualPath

Having split the string into pieces it should be a simple matter to screen for the array elements with the correct starting and ending characters, also to find the last %3Atheory... entry.

Note: The question discusses +OR and +AND+ but all the +OR s are followed with a + so it may be better to include a final + in the expression, as ...OR)\\\\+) .

Note: The inner brackets in the regular expression are non capturing, ie (?: ) . If they were capturing brackets then the AND and OR captures would be included in the output array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM