简体   繁体   中英

How do I do Python re.search substrings with multi-character wildcard?

I'm trying to extract a substring from a string in Python. The front end to be trimmed is static and easy to implement, but the rear end has a counter that can run from "_0" to "_9999".

With my current code, the counter still gets included in the substring.

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('runid_(.*)_*.fas', text).group(0)

print(substring)

Returns

0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fas

Alternatively,

substring= re.search(r"(?<=runid_).*?(?=_*.fastq)", text).group(0)

returns

0dc971f49c42ffb1412caee485f8421a1f9a26ed_0

Works better but the counter "_0" is still added.

How do I make a robust trim that trims the multi-character counter?

In your regex (?<=runid_).*?(?=_*.fastq) there is a little problem. You have written _* which means zero or more underscores which will make underscore optional and will skip it matching and your .*? will eat _0 too within it which is why in your result you get _0 too. I think you meant _.* and also you should escape the . just before fastq so your updated regex should become this,

(?<=runid_).+(?=_\d{1,4}\.fas)

Demo

Your updated python code,

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('(?<=runid_).+(?=_\d{1,4}\.fas)', text).group(0)

print(substring)

Prints,

0dc971f49c42ffb1412caee485f8421a1f9a26ed

Also, alternatively, you can use a simple regex without lookarounds and capture the text from first group using this regex,

runid_([^_]+)(?=_\d{1,4}\.fas)

Demo

Your python code with text picking from group(1) instead of group(0)

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"

print(text)
substring= re.search('runid_([^_]+)(?=_\d{1,4}\.fas)', text).group(1)

print(substring)

In this case too it prints,

0dc971f49c42ffb1412caee485f8421a1f9a26ed

You don't need look behind and look ahead to achieve that.

\\d{1,4} means min 1 max 4 digits, otherwise it wont match

fastq_runid_(.+)_\d{1,4}\.fastq

https://regex101.com/r/VneElM/1

import re

text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_999.fastq"

print(text)
substring= re.search('fastq_runid_(\w+)_(\d+)\.fastq', text)

print(substring.group(1), substring.group(2))

group(1) will give what you want, group(2) will give the counter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM