简体   繁体   中英

Regex to match the first item based on a word

Following is a string which i like to parse

a='   //TS_START
    /*TG_HEADER_START
        title="XYX"
        ident=""
    */
    /*
    <TC_HEADER_START>
        title=" Halted after Tester Connect" 
        ident="TC1" 
        variants="A C" 
        name="TC">
        TestcaseDescription= This >
        TestcaseRequirements=36978
        StakeholderRequirements=1236                
        TestcaseParameters:
        TS_Implemented=Yes;
        TS_Automation=Automated;
        TS_Techniques= Testing;
        TS_Priority=1;
        TS_Tested_By=qz9ghv;
        TS_Review_done=Yes;
        TS_Regression=No
        TestcaseTestType=Test  
    </TC_HEADER_END>
    <TC_HEADER_START>
        title=" Halted after Tester Connect" 
        ident="TC1" 
        variants="A C" 
        name="TC">
        TestcaseDescription= This >
        TestcaseRequirements=36978
        StakeholderRequirements=1236                
        TestcaseParameters:
        TS_Implemented=Yes;
        TS_Automation=Automated;
        TS_Techniques= Testing;
        TS_Priority=1;
        TS_Tested_By=qz9ghv;
        TS_Review_done=Yes;
        TS_Regression=No
        TestcaseTestType=Test  
    </TC_HEADER_END>
    */
    testcase TC_GEEA2_VGM_DOIP_01(char strDescription[], char strReq[], char strParams[])
    {
     }
    /*TG_HEADER_END*/




    zd.a.S,D.,AS'
    A/S,D/.A.SD./
    //<TS_END>'

I like to parse the string and get a list of strings which starts from <TC_HEADER_START> and ends with </TC_HEADER_END> . I had tried writing the following regex which is matching all instead of the first match.

aa=re.findall(r'<TC_HEADER_START>([\s\S]*)</TC_HEADER_END>',a)

Expected output

aa=['<TC_HEADER_START>
        title=" Halted after Tester Connect" 
        ident="TC1" 
        variants="A C" 
        name="TC">
        TestcaseDescription= This >
        TestcaseRequirements=36978
        StakeholderRequirements=1236                
        TestcaseParameters:
        TS_Implemented=Yes;
        TS_Automation=Automated;
        TS_Techniques= Testing;
        TS_Priority=1;
        TS_Tested_By=qz9ghv;
        TS_Review_done=Yes;
        TS_Regression=No
        TestcaseTestType=Test  
    </TC_HEADER_END>','<TC_HEADER_START>
        title=" Halted after Tester Connect" 
        ident="TC1" 
        variants="A C" 
        name="TC">
        TestcaseDescription= This >
        TestcaseRequirements=36978
        StakeholderRequirements=1236                
        TestcaseParameters:
        TS_Implemented=Yes;
        TS_Automation=Automated;
        TS_Techniques= Testing;
        TS_Priority=1;
        TS_Tested_By=qz9ghv;
        TS_Review_done=Yes;
        TS_Regression=No
        TestcaseTestType=Test  
    </TC_HEADER_END>']

your regex is almost correct - you want to use a lazy quantifier ( *? ) instead of a greedy one ( * ).

try this:

<TC_HEADER_START>([\s\S]*?)</TC_HEADER_END>

or try it on regex101

Edit:

if you want to include the enclosing tags, wrap them into capturing groups, too:

(<TC_HEADER_START>)([\s\S]*?)(</TC_HEADER_END>)

updated regex101

re.M , re.S _> https://docs.python.org/3/library/re.html?highlight=re.S#re.MULTILINE

import re

aa=re.findall(r'<TC_HEADER_START>(.*?)</TC_HEADER_END>',a,re.S)
print(len(aa))
print(aa[0])

Output:

2

    title=" Halted after Tester Connect" 
    ident="TC1" 
    variants="A C" 
    name="TC">
    TestcaseDescription= This >
    TestcaseRequirements=36978
    StakeholderRequirements=1236                
    TestcaseParameters:
    TS_Implemented=Yes;
    TS_Automation=Automated;
    TS_Techniques= Testing;
    TS_Priority=1;
    TS_Tested_By=qz9ghv;
    TS_Review_done=Yes;
    TS_Regression=No
    TestcaseTestType=Test  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM