简体   繁体   中英

Extract multiline text between two strings using python

I have a text file which looks like the below dummy file

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised
in the 1960s with the release of Letraset 
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
sheets containing Lorem Ipsum passages,
and more recently with desktop publishing
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
software like Aldus PageMaker including
versions of Lorem Ipsum.

I want to extract data between "start of my data" till "end of my data" and save it on a list variable. This data comes multiple times on the text file. I tried the below code for it

import re
import sys
s=[]
with open('mytextfile.txt','r') as file:
    mystring = file.read()
    myre = re.compile(r"start of my data(.*?)end of my data", re.DOTALL)
    parts = myre.findall(mystring)
    s.append(parts)

This code saves all the found strings at once on the first index of the list. but i need each separate data on a new index. How i can achieve this?

With s.append(parts) you append the whole list parts as a single element to the array s , that's why s ends up having only one element (which is a list of 3 elements). Instead you need s.extend(parts) if you want to append the 3 elements of parts individually to s .

Split the data lines of your capture group by \n :

import re
s=[]
mystring = """
paste your string here
"""
myre = re.compile(r"start of my data(.*?)end of my data", re.DOTALL)
parts = myre.findall(mystring)
for part in parts:
    s.extend(part.split("\n"))
print(len(s))

Result for the example data provided is 24.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM