I'm trying to extract a substring from a string in Python. The front end to be trimmed is static and easy to implement, but the rear end has a counter that can run from "_0" to "_9999".
With my current code, the counter still gets included in the substring.
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('runid_(.*)_*.fas', text).group(0)
print(substring)
Returns
0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fas
Alternatively,
substring= re.search(r"(?<=runid_).*?(?=_*.fastq)", text).group(0)
returns
0dc971f49c42ffb1412caee485f8421a1f9a26ed_0
Works better but the counter "_0" is still added.
How do I make a robust trim that trims the multi-character counter?
In your regex (?<=runid_).*?(?=_*.fastq)
there is a little problem. You have written _*
which means zero or more underscores which will make underscore optional and will skip it matching and your .*?
will eat _0
too within it which is why in your result you get _0
too. I think you meant _.*
and also you should escape the .
just before fastq
so your updated regex should become this,
(?<=runid_).+(?=_\d{1,4}\.fas)
Your updated python code,
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('(?<=runid_).+(?=_\d{1,4}\.fas)', text).group(0)
print(substring)
Prints,
0dc971f49c42ffb1412caee485f8421a1f9a26ed
Also, alternatively, you can use a simple regex without lookarounds and capture the text from first group using this regex,
runid_([^_]+)(?=_\d{1,4}\.fas)
Your python code with text picking from group(1)
instead of group(0)
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('runid_([^_]+)(?=_\d{1,4}\.fas)', text).group(1)
print(substring)
In this case too it prints,
0dc971f49c42ffb1412caee485f8421a1f9a26ed
You don't need look behind and look ahead to achieve that.
\\d{1,4}
means min 1
max 4
digits, otherwise it wont match
fastq_runid_(.+)_\d{1,4}\.fastq
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_999.fastq"
print(text)
substring= re.search('fastq_runid_(\w+)_(\d+)\.fastq', text)
print(substring.group(1), substring.group(2))
group(1)
will give what you want, group(2)
will give the counter.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.