How do I do Python re.search substrings with multi-character wildcard?

Question

I'm trying to extract a substring from a string in Python. The front end to be trimmed is static and easy to implement, but the rear end has a counter that can run from "_0" to "_9999".

With my current code, the counter still gets included in the substring.

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('runid_(.*)_*.fas', text).group(0)

print(substring)

Returns

0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fas

Alternatively,

substring= re.search(r"(?<=runid_).*?(?=_*.fastq)", text).group(0)

returns

0dc971f49c42ffb1412caee485f8421a1f9a26ed_0

Works better but the counter "_0" is still added.

How do I make a robust trim that trims the multi-character counter?

Answer 1

In your regex (?<=runid_).*?(?=_*.fastq) there is a little problem. You have written _* which means zero or more underscores which will make underscore optional and will skip it matching and your .*? will eat _0 too within it which is why in your result you get _0 too. I think you meant _.* and also you should escape the . just before fastq so your updated regex should become this,

(?<=runid_).+(?=_\d{1,4}\.fas)

Demo

Your updated python code,

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('(?<=runid_).+(?=_\d{1,4}\.fas)', text).group(0)

print(substring)

Prints,

0dc971f49c42ffb1412caee485f8421a1f9a26ed

Also, alternatively, you can use a simple regex without lookarounds and capture the text from first group using this regex,

runid_([^_]+)(?=_\d{1,4}\.fas)

Demo

Your python code with text picking from group(1) instead of group(0)

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('runid_([^_]+)(?=_\d{1,4}\.fas)', text).group(1)

print(substring)

In this case too it prints,

0dc971f49c42ffb1412caee485f8421a1f9a26ed

Answer 2

You don't need look behind and look ahead to achieve that.

\\d{1,4} means min 1 max 4 digits, otherwise it wont match

fastq_runid_(.+)_\d{1,4}\.fastq

https://regex101.com/r/VneElM/1

Answer 3

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_999.fastq"

print(text)
substring= re.search('fastq_runid_(\w+)_(\d+)\.fastq', text)

print(substring.group(1), substring.group(2))

group(1) will give what you want, group(2) will give the counter.

How do I do Python re.search substrings with multi-character wildcard?

Question

3 answers

solution1
1 ACCPTED 2019-01-23 04:54:05

solution2
1 2019-01-23 04:57:41

solution3
1 2019-01-23 05:07:52

How do I do Python re.search substrings with multi-character wildcard?

Question

3 answers

solution1 1 ACCPTED 2019-01-23 04:54:05

solution2 1 2019-01-23 04:57:41

solution3 1 2019-01-23 05:07:52

solution1
1 ACCPTED 2019-01-23 04:54:05

solution2
1 2019-01-23 04:57:41

solution3
1 2019-01-23 05:07:52