Extracting text between two delimiters when the delimiters are in different formats using Python

Question

I'm a new Python programmer (more experience in R) using Pycharm community edition v2019 2.4, using a laptop running Windows 10. I'm attempting to extract a block of text between two delimiters which is usually in the following format. (text is between the delimiters but on separate lines)

Item 7.
text, text, text, text
text, text, text, text
Item 7A.

The problem I'm experiencing is that Item 7 and Item 7A can come in many different formats due to the initial pre-processing of the text files, for example.

Item 7.  
text 
Item 7A.

or

ITEM 7  
text
ITEM 7A.

or

ITEM 7 
text  
ITEM 7A:

or

Item 
7
text
Item 
7A.

Item 7 and Item 7A can, also appear in larger blocks of text. This is an issue beyond my control.

I've examined 100 text files so far and have written the following code.

import glob
import os
from os.path import isfile

path = filepath` 
for filename in glob.glob(os.path.join(path, '*.txt')):
     with open(filename) as f:
     data = f.read()

     x = re.findall(r'Item 7(.*?)Item 7A',data, re.DOTALL)
     "".join(x).replace('\n',' ')
     print(x)

     file = open('C:/R_Practice/dale1.txt', 'w')
     file.write(str(x))

     file.close()

This deals with some, but not all of the cases, and even then it's not detecting everything. It won't be possible to analyse the full set of text files as there will be close to 250,000 for the full study. My questions are as follows.

Is there a "catch all" code which will search for all occurences of the delimiters, even if parts of the string are on separate lines?
Can each individual text block be written to a separate text file on hard drive, once identified?
Can a logfile be written showing which text files were not processed because the algorithm "missed" the delimiters owing to formatting issues?

Any help would be appreciated.

Answer 1

Instead of static space, use \\s (that means any kind of spaces, including linebreak) between item & 7

import glob
import os
from os.path import isfile

path = filepath
for filename in glob.glob(os.path.join(path, '*.txt')):
   with open(filename) as f:
     data = f.read()

     x = re.findall(r'Item\s+7(.*?)Item\s+7A',data, re.DOTALL | re.IGNORECASE)
     #            here ___^^^   and ___^^^
     "".join(x).replace('\n',' ')
     print(x)

     file = open('C:/R_Practice/dale1.txt', 'w')
     file.write(str(x))

     file.close()

Extracting text between two delimiters when the delimiters are in different formats using Python

Question

1 answers

solution1
0 ACCPTED 2019-11-29 12:44:26

Extracting text between two delimiters when the delimiters are in different formats using Python

Question

1 answers

solution1 0 ACCPTED 2019-11-29 12:44:26

solution1
0 ACCPTED 2019-11-29 12:44:26