Extract word form string using regex word boundaries in python

Question

Suppose I have such a file name and I want to extract part of it as a string in Python

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('\b_[A-Z]{2}\b')
print(re.findall(rgx, fn))

Expected out put [DE] , but actual out is [] .

Answer 1

You could use

(?<=_)[A-Z]+(?=_)

This makes use of lookarounds on both sides, see a demo on regex101.com . For tighter results, you'd need to specify more sample inputs though.

Answer 2

Use _([AZ]{2})

Ex:

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('_([A-Z]{2})')
print(rgx.findall(fn))           #You can use the compiled pattern to do findall.

Output:

['DE']

Answer 3

Your desired output seems to be DE which is in bounded with two _ from left and right. This expression might also work:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]+)_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output

YAAAY! "DE" is a match 💚💚💚

Or you can add a 2 quantifier, if you might want:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]{2})_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

DEMO

Answer 4

Try pattern: \\_([^\\_]+)\\_[^\\_\\.]+\\.xlsx

Explanation:

\\_ - match _ literally

[^\\_]+ - negated character class with + operator: match one or more times character other than _

[^\\_\\.]+ - same as above, but this time match characters other than _ and .

\\.xlsx - match .xlsx literally

Demo

The idea is to match last pattern _something_ before extension .xlsx

Answer 5

You could use regular expression ( re module) for that as already shown, however this could be done without using any import s, following way:

fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
out = [i for i in fn.split('_')[1:] if len(i)==2 and i.isalpha() and i.isupper()]
print(out) # ['DE']

Explanation: I split fn at _ then discard 1st element and filter elements so only str s of length 2, consisting of letters and consisting of uppercases remain.

Answer 6

Another re solution:

rgx = re.compile('_([A-Z]{1,})_')
print(re.findall(rgx, fn))

Extract word form string using regex word boundaries in python

Question

6 answers

solution1
2 ACCPTED 2019-05-21 06:29:51

solution2
1 2019-05-21 06:29:13

solution3
1 2019-05-21 06:34:06

Output

DEMO

solution4
1 2019-05-21 06:41:52

solution5
1 2019-05-21 07:11:41

solution6
0 2019-05-21 06:40:47

Extract word form string using regex word boundaries in python

Question

6 answers

solution1 2 ACCPTED 2019-05-21 06:29:51

solution2 1 2019-05-21 06:29:13

solution3 1 2019-05-21 06:34:06

Output

DEMO

solution4 1 2019-05-21 06:41:52

solution5 1 2019-05-21 07:11:41

solution6 0 2019-05-21 06:40:47

solution1
2 ACCPTED 2019-05-21 06:29:51

solution2
1 2019-05-21 06:29:13

solution3
1 2019-05-21 06:34:06

solution4
1 2019-05-21 06:41:52

solution5
1 2019-05-21 07:11:41

solution6
0 2019-05-21 06:40:47