简体   繁体   中英

Fetching content from numbered list in file with python regular expressions

I have file with following lines:

lines.txt

1. robert
   smith
2. harry
3. john

I want to get array as follows:

["robert\nsmith","harry","john"]

I tried something like this:

with open('lines.txt') as fh:
    m = [re.match(r"^\d+\.(.*)",line) for line in fh.readlines()]
    print(m)
    for i in m:
        print(i.groups())

It outputs following:

[<_sre.SRE_Match object; span=(0, 9), match='1. robert'>, None, <_sre.SRE_Match object; span=(0, 8), match='2. harry'>, <_sre.SRE_Match object; span=(0, 7), match='3. john'>]
(' robert',)
Traceback (most recent call last):
  File "D:\workspaces\workspace6\PdfGenerator\PdfGenerator.py", line 5, in <module>
    print(i.groups())
AttributeError: 'NoneType' object has no attribute 'groups'

It seems that I am approaching this problem in very wrong way. How you will solve this?

You may read in the file into memory and use

r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)'

See the regex demo

Details

  • (?ms) - enable re.MULTILINE and re.DOTALL modes
  • ^ - start of a line
  • \\d+ - 1+ digits
  • \\. - a dot
  • \\s* - 0+ whitespaces
  • (.*?) - Group 1 (this is what re.findall returns here): any 0+ chars, as few as possible
  • (?=^\\d+\\.|\\Z) - up to (but not inlcuding) the first occurrence of
    • ^\\d+\\. - start of a line, 1+ digits and .
    • | - or
    • \\Z - end of string.

Python:

with open('lines.txt') as fh:
    print(re.findall(r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)', fh.read()))

Use re.findall to find all from \\d\\.\\s+ pattern to next '\\n\\d' pattern or upto end

>>> import re
>>> re.findall(r'\d+\.\s+(.*?(?=\n\d|$))', text, flags=re.DOTALL)
['robert\n   smith', 'harry', 'john']

You can use re.split .

Regex : \\n?\\d+\\.\\s*

Details:

  • \\n - Newline
  • ? - Matches between zero and one times, match if 'new line' exists
  • \\d+ - Matches a digit (+) between one and unlimited times
  • \\. - Dot
  • \\s* - Matches any whitespace character (equal to [\\r\\n\\t\\f\\v ] ) (*) between zero and unlimited times

Python code :

re.split(r'\n?\d+\.\s*', lines)[1:]

[1:] removes the first item because its empty string

Output:

['robert\n   smith', 'harry', 'john']

I propose a solution, which gathers only names, without unnecessary spaces in the middle of names, contrary to some other solutions.

The idea is:

  • Save a list of tuples (number, name_segment) , "copying" the group number from previous line if absent in the current line. The pair to be saved is prepared by getPair function.
  • Group these tuples on the number (first element).
  • Join name segments from each group, using \\n as separator.
  • Save these joined names in a result list.

Using list comprehensions allows to write the program in a quite concise way. See below:

import re, itertools

def getPair(line):
  global grp
  nr, nameSegm = re.match(r'^(\d+\.)?\s+(\w+)$', line).groups()
  if nr:  # Number present
    grp = nr
  return grp, nameSegm   

grp = ''    # Group label (number)
with open('lines.txt') as fh:
  lst = [getPair(line) for line in fh.readlines()]
res = ['\n'.join([t[1] for t in g])
  for _, g in itertools.groupby(lst, lambda x: x[0])]
print(f"Result: {res}")

To sum up, the program is a bit longer than other, but is gives only names, without additional spaces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM