简体   繁体   中英

Python - Extract string from a text file until the first 2 new line space

I have an input file from while I have to extract few lines based on 2 blank new lines.

For eg: The text file is like below.

1. Sometext
Sometext 
Sometext

2. Sometext
Sometext
Sometext

3. Sometext
Sometext
Sometext

Sometext which is not needed
Sometext which is not needed
Sometext which is not needed

I have to extract a substring stating from "1." to all before "2." and the 2nd substring from "2." to all before "3." and so on based on the numbers. I have the below script which gets the output but it also gets all of "Sometext which is not needed" which i do not want. Please see the code below:

file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0   

while (start<size):
   if (end !=-1):
   subString = content[content.find(str(a)+".")+0:content.find("\n"+str(b)+".")] 
   print (subString)
   end = content.find(str(b)+".",start)
                print ("\n")
                a = int(a)+1 # increment to find the next start number
                b = int(b)+1 # increment to find the next end number
                start = end+1 # continuing to search the next
            else:
                break

So, I decided to find for 2 continuous blank lines for the end position and used the below one but that didnt work.

subString = content[content.find (str(a)+".")+3:content.find("\n\n")]

Please help and let me know if you have any questions. Thank you in advance.

I'm not sure if I understood your question correctly, but here is the code that will output:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']

based on the text in your question. If instead you want 1 to 2 to be a whole substring like this:

['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']

You should change the if statement to:

if is_number(i[0]):
            substring = []
            substring.append(i)
            print(substring)

else you can use the code below

def is_number(string):
    try:
        float(string)
        return True
    except ValueError:
        return False

with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
    if is_number(i[0]):
        c = i.split('\n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]
        print(substring)

You will have to filter the unwanted lines at the end but this will get you want you want:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])

Output:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]

Since all the sections you want to keep start with a digit:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))

Output:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

If you know there are n groups you could slice:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))

Output:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM