Fetching content from numbered list in file with python regular expressions

Question

I have file with following lines:

lines.txt

1. robert
   smith
2. harry
3. john

I want to get array as follows:

["robert\nsmith","harry","john"]

I tried something like this:

with open('lines.txt') as fh:
    m = [re.match(r"^\d+\.(.*)",line) for line in fh.readlines()]
    print(m)
    for i in m:
        print(i.groups())

It outputs following:

[<_sre.SRE_Match object; span=(0, 9), match='1. robert'>, None, <_sre.SRE_Match object; span=(0, 8), match='2. harry'>, <_sre.SRE_Match object; span=(0, 7), match='3. john'>]
(' robert',)
Traceback (most recent call last):
  File "D:\workspaces\workspace6\PdfGenerator\PdfGenerator.py", line 5, in <module>
    print(i.groups())
AttributeError: 'NoneType' object has no attribute 'groups'

It seems that I am approaching this problem in very wrong way. How you will solve this?

Answer 1

You may read in the file into memory and use

r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)'

See the regex demo

Details

(?ms) - enable re.MULTILINE and re.DOTALL modes
^ - start of a line
\\d+ - 1+ digits
\\. - a dot
\\s* - 0+ whitespaces
(.*?) - Group 1 (this is what re.findall returns here): any 0+ chars, as few as possible
(?=^\\d+\\.|\\Z) - up to (but not inlcuding) the first occurrence of
- ^\\d+\\. - start of a line, 1+ digits and .
- | - or
- \\Z - end of string.

Python:

with open('lines.txt') as fh:
    print(re.findall(r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)', fh.read()))

Answer 2

Use re.findall to find all from \\d\\.\\s+ pattern to next '\\n\\d' pattern or upto end

>>> import re
>>> re.findall(r'\d+\.\s+(.*?(?=\n\d|$))', text, flags=re.DOTALL)
['robert\n   smith', 'harry', 'john']

Answer 3

You can use re.split .

Regex : \\n?\\d+\\.\\s*

Details:

\\n - Newline
? - Matches between zero and one times, match if 'new line' exists
\\d+ - Matches a digit (+) between one and unlimited times
\\. - Dot
\\s* - Matches any whitespace character (equal to [\\r\\n\\t\\f\\v ] ) (*) between zero and unlimited times

Python code :

re.split(r'\n?\d+\.\s*', lines)[1:]

[1:] removes the first item because its empty string

Output:

['robert\n   smith', 'harry', 'john']

Answer 4

I propose a solution, which gathers only names, without unnecessary spaces in the middle of names, contrary to some other solutions.

The idea is:

Save a list of tuples (number, name_segment) , "copying" the group number from previous line if absent in the current line. The pair to be saved is prepared by getPair function.
Group these tuples on the number (first element).
Join name segments from each group, using \\n as separator.
Save these joined names in a result list.

Using list comprehensions allows to write the program in a quite concise way. See below:

import re, itertools

def getPair(line):
  global grp
  nr, nameSegm = re.match(r'^(\d+\.)?\s+(\w+)$', line).groups()
  if nr:  # Number present
    grp = nr
  return grp, nameSegm   

grp = ''    # Group label (number)
with open('lines.txt') as fh:
  lst = [getPair(line) for line in fh.readlines()]
res = ['\n'.join([t[1] for t in g])
  for _, g in itertools.groupby(lst, lambda x: x[0])]
print(f"Result: {res}")

To sum up, the program is a bit longer than other, but is gives only names, without additional spaces.

Fetching content from numbered list in file with python regular expressions

Question

4 answers

solution1
1 2018-07-31 12:33:18

solution2
1 2018-07-31 12:39:26

solution3
1 2018-07-31 12:54:02

solution4
0 2018-07-31 16:07:54

Fetching content from numbered list in file with python regular expressions

Question

4 answers

solution1 1 2018-07-31 12:33:18

solution2 1 2018-07-31 12:39:26

solution3 1 2018-07-31 12:54:02

solution4 0 2018-07-31 16:07:54

solution1
1 2018-07-31 12:33:18

solution2
1 2018-07-31 12:39:26

solution3
1 2018-07-31 12:54:02

solution4
0 2018-07-31 16:07:54