简体   繁体   中英

Python - Extract string from a text file until the first 2 new line space

I have an input file from while I have to extract few lines based on 2 blank new lines.

For eg: The text file is like below.

1. Sometext

2. Sometext

3. Sometext

Sometext which is not needed
Sometext which is not needed
Sometext which is not needed

I have to extract a substring stating from "1." to all before "2." and the 2nd substring from "2." to all before "3." and so on based on the numbers. I have the below script which gets the output but it also gets all of "Sometext which is not needed" which i do not want. Please see the code below:

file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
end =0
ext =0   

while (start<size):
   if (end !=-1):
   subString = content[content.find(str(a)+".")+0:content.find("\n"+str(b)+".")] 
   print (subString)
   end = content.find(str(b)+".",start)
                print ("\n")
                a = int(a)+1 # increment to find the next start number
                b = int(b)+1 # increment to find the next end number
                start = end+1 # continuing to search the next

So, I decided to find for 2 continuous blank lines for the end position and used the below one but that didnt work.

subString = content[content.find (str(a)+".")+3:content.find("\n\n")]

Please help and let me know if you have any questions. Thank you in advance.

I'm not sure if I understood your question correctly, but here is the code that will output:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']

based on the text in your question. If instead you want 1 to 2 to be a whole substring like this:

['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']

You should change the if statement to:

if is_number(i[0]):
            substring = []

else you can use the code below

def is_number(string):
        return True
    except ValueError:
        return False

with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
    if is_number(i[0]):
        c = i.split('\n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]

You will have to filter the unwanted lines at the end but this will get you want you want:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])


[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]

Since all the sections you want to keep start with a digit:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))


[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

If you know there are n groups you could slice:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))


[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM