简体   繁体   中英

regular expression for the extracting multiple patterns

I have string like this

string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""

tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)

Output

 ('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
 ('General Info.', '[PA Number:', ' R180126187]')
 ('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')

Wanted output:

 ('Claim Status', 'Primary Status:Paidup to Rebilled')
 ('General Info.', 'PA Number:R180126187')
 ('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

You may achieve what you need with a solution like this:

import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
    t = []
    t.append(m.group(1).strip())
    if m.group(2):
        t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
    res.append(tuple(t))
print(res)

See the Python online demo . Output:

[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]

With the ^(.+)(?:\\r?\\n\\s*\\[(.+)])?\\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ] ) is captured into Group 2. (Note that \\r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\\s*,\\s*\\[', m.group(2)) ) and then only add those items that contain a : in them to the temporary list.

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:

re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)

A group delimited by (?:...) (instead of (...) ) is "non-capturing", ie it doesn't count as a match target for \\1 etc., and it does not get "seen" by re.findall . I have made both your groups non-capturing, and added a single capturing (regular) group around them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM