简体   繁体   中英

Regex for matching this kind of String in Python 3.x

I have this string,

irn
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626

but sometimes, it will be

irn 
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626

, or

irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626

or

irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626

Actually from apache tika, I am reading the contents of the pdf and getting the output, so i am using,

re.findall(r'\w+',payload)

to pickup all the words and not any other character.

I am using this regex to match the above string,

irn(\s+?)(\w+\s+?)(([a-zA-Z0-9]{64})|([a-zA-Z0-9\s+]{65}))

this is working fine for

irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626

irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626

but for this case:

irn
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626

the 2rd group is catching the 2nd line and the group 6 is catching the 3rd line and below subsequent lines till 64 characters.

Since it is not in my hands to maintain the data format in the pdf, can you please help me out here to fix this.

actually, the string will start from "irn", then there may or may not be some words, and then the irn number will be fixed 64 characters long.

You may use this regex with an optional match in 2nd line:

^irn[\r\n]+(?:(\w+)[\r\n]+)?([a-zA-Z0-9\r\n]{64,65})$

RegEx Demo

Explanation:

  • ^irn[\r\n]+ : Match irn followed by a 1+ newline characters
  • (?:(\w+)[\r\n]+)? : Optionally match 1+ word characters followed by 1+ line breaks and capture word in group #1
  • ([a-zA-Z0-9\r\n]{64,65}) : Match alphanumerical character or a line feed character 64 or 65 times. Capture this in group #2
  • $ : End

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM