简体   繁体   中英

Python Regex is not matching the first line

I have a text file and the content is,

Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15

I have written a Regex Pattern, to extract the value between first , and second ,

^\w+,(\w+),.*$

My Python code is

import re

f=r'sample.txt'
rePat = re.compile('^\w+,(\w+),.*$', re.MULTILINE)

text = open(f, 'r').read()
output = re.findall(rePat, text)

print (f)
print (output)

Expected Output:

Assigned
10
20
30

But I am getting

10
20
30

Why it is missing the first line?

The problem is due to the fact that \\w+ matches one or more word chars (basically, letters, digits, underscores and also some diacritics). You have a space in between the second and third commas, so I suggest matching any chars between commas with [^,\\n]+ (the \\n here is to make sure we stay within the same line).

You can use

rePat = re.compile(r'^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)

Or, a bit simplified if you do not need to extract anything else:

rePat = re.compile(r'^[^,\n]+,([^,\n]+)', re.MULTILINE)

See this regex demo . Details :

  • ^ - start of a line
  • [^,\\n]+ - one or more chars other than , and LF
  • , - a comma
  • ([^,\\n]+) - Group 1: one or more chars other than , and LF.

See a Python demo :

import re
 
text = r"""Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15"""
 
rePat = re.compile('^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
output = re.findall(rePat, text)
print (output)
# => ['Assigned', '10', '20', '30']

You could add matching optional spaces and word characters after the first \\w+ to match till the first comma.

^\w+(?: \w+)*,(\w+),.*$
  • ^ Start of string
  • \\w+ Match 1+ word chars
  • (?: \\w+)* Optionally repeat matching a space and 1+ word chars
  • ,(\\w+), Match a comma and capture 1+ word chars in group 1
  • .*$ ( You could omit this part)

Regex demo

import re

f = r'sample.txt'
rePat = re.compile('^\w+(?: \w+)*,(\w+),.*$', re.MULTILINE)

text = open(f, 'r').read()
output = re.findall(rePat, text)
print(output)

Output

['Assigned', '10', '20', '30']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM