I have file with following lines:
lines.txt
1. robert
smith
2. harry
3. john
I want to get array as follows:
["robert\nsmith","harry","john"]
I tried something like this:
with open('lines.txt') as fh:
m = [re.match(r"^\d+\.(.*)",line) for line in fh.readlines()]
print(m)
for i in m:
print(i.groups())
It outputs following:
[<_sre.SRE_Match object; span=(0, 9), match='1. robert'>, None, <_sre.SRE_Match object; span=(0, 8), match='2. harry'>, <_sre.SRE_Match object; span=(0, 7), match='3. john'>]
(' robert',)
Traceback (most recent call last):
File "D:\workspaces\workspace6\PdfGenerator\PdfGenerator.py", line 5, in <module>
print(i.groups())
AttributeError: 'NoneType' object has no attribute 'groups'
It seems that I am approaching this problem in very wrong way. How you will solve this?
You may read in the file into memory and use
r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)'
See the regex demo
Details
(?ms)
- enable re.MULTILINE
and re.DOTALL
modes ^
- start of a line \\d+
- 1+ digits \\.
- a dot \\s*
- 0+ whitespaces (.*?)
- Group 1 (this is what re.findall
returns here): any 0+ chars, as few as possible (?=^\\d+\\.|\\Z)
- up to (but not inlcuding) the first occurrence of
^\\d+\\.
- start of a line, 1+ digits and .
|
- or \\Z
- end of string. Python:
with open('lines.txt') as fh:
print(re.findall(r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)', fh.read()))
Use re.findall
to find all from \\d\\.\\s+
pattern to next '\\n\\d' pattern or upto end
>>> import re
>>> re.findall(r'\d+\.\s+(.*?(?=\n\d|$))', text, flags=re.DOTALL)
['robert\n smith', 'harry', 'john']
You can use re.split
.
Regex : \\n?\\d+\\.\\s*
Details:
\\n
- Newline ?
- Matches between zero and one times, match if 'new line' exists \\d+
- Matches a digit (+) between one and unlimited times \\.
- Dot \\s*
- Matches any whitespace character (equal to [\\r\\n\\t\\f\\v ]
) (*) between zero and unlimited times Python code :
re.split(r'\n?\d+\.\s*', lines)[1:]
[1:]
removes the first item because its empty string
Output:
['robert\n smith', 'harry', 'john']
I propose a solution, which gathers only names, without unnecessary spaces in the middle of names, contrary to some other solutions.
The idea is:
Using list comprehensions allows to write the program in a quite concise way. See below:
import re, itertools
def getPair(line):
global grp
nr, nameSegm = re.match(r'^(\d+\.)?\s+(\w+)$', line).groups()
if nr: # Number present
grp = nr
return grp, nameSegm
grp = '' # Group label (number)
with open('lines.txt') as fh:
lst = [getPair(line) for line in fh.readlines()]
res = ['\n'.join([t[1] for t in g])
for _, g in itertools.groupby(lst, lambda x: x[0])]
print(f"Result: {res}")
To sum up, the program is a bit longer than other, but is gives only names, without additional spaces.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.