简体   繁体   中英

Extract word form string using regex word boundaries in python

Suppose I have such a file name and I want to extract part of it as a string in Python

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('\b_[A-Z]{2}\b')
print(re.findall(rgx, fn))

Expected out put [DE] , but actual out is [] .

You could use

(?<=_)[A-Z]+(?=_)

This makes use of lookarounds on both sides, see a demo on regex101.com . For tighter results, you'd need to specify more sample inputs though.

Use _([AZ]{2})

Ex:

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('_([A-Z]{2})')
print(rgx.findall(fn))           #You can use the compiled pattern to do findall. 

Output:

['DE']

Your desired output seems to be DE which is in bounded with two _ from left and right. This expression might also work:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]+)_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output

YAAAY! "DE" is a match 💚💚💚

Or you can add a 2 quantifier, if you might want:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]{2})_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

在此处输入图片说明

DEMO

Try pattern: \\_([^\\_]+)\\_[^\\_\\.]+\\.xlsx

Explanation:

\\_ - match _ literally

[^\\_]+ - negated character class with + operator: match one or more times character other than _

[^\\_\\.]+ - same as above, but this time match characters other than _ and .

\\.xlsx - match .xlsx literally

Demo

The idea is to match last pattern _something_ before extension .xlsx

You could use regular expression ( re module) for that as already shown, however this could be done without using any import s, following way:

fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
out = [i for i in fn.split('_')[1:] if len(i)==2 and i.isalpha() and i.isupper()]
print(out) # ['DE']

Explanation: I split fn at _ then discard 1st element and filter elements so only str s of length 2, consisting of letters and consisting of uppercases remain.

Another re solution:

rgx = re.compile('_([A-Z]{1,})_')
print(re.findall(rgx, fn))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM