Split string text following a specific pattern in Python

Question

I'm working with some text that follows a specific pattern (it's a Table of Contents) that I'm trying to extract. For example,

rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '

The text follows a specific pattern, namely: (Section Number) then (Section Name) and finally (Page Number).

I'm not very good with regular expressions but have cobbled together some checks to extract and put these variables in a dataframe.

This works fine for extracting the Section Name and Section Page (though I'm sure it could be improved), but I can't identify the Section Number using this method, since we can have both integers (eg '2' for the 'RISK FACTORS' section), decimals (eg '1.1' for the 'Structure diagram' section), or none at all (eg the 'TABLE OF CONTENTS' text has no section number preceding it).

I think a more efficient way would be to pass everything into a python function (re.match? re.findall?) and extract everything according to the pattern itself, ie NUMBERS OR DECIMALS (IF PRESENT) ; (Letters and spaces in between the letters) ; NUMBERS

So this would mean having an output like:

import pandas as pd
import re
import numpy as np
toc = pd.DataFrame()
toc['SectionName'] = re.findall(r'[A-Za-z-]+[ ]+[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*', rawtext) # get the section names
toc['SectionPage'] = re.findall(r'[ ]+[0-9]*[ ]+', rawtext) # get the page numbers
toc.loc[1,'SectionNum'] = np.nan
toc.loc[1,'SectionNum'] = 1
toc.loc[2,'SectionNum'] = 1.1
toc.loc[3,'SectionNum'] = 1.2
toc.loc[4,'SectionNum'] = 1.3
toc.loc[5,'SectionNum'] = 1.4
toc.loc[6,'SectionNum'] = 1.5
toc.loc[7,'SectionNum'] = 1.6
toc.loc[8,'SectionNum'] = 1.7
toc.loc[9,'SectionNum'] = 1.8
toc.loc[10,'SectionNum'] = 2

toc = toc[['SectionNum', 'SectionName', 'SectionPage']]
print(toc)

I really can't manage this though; I've been trying for a few days now and have tried searching all over Stack Overflow but no luck (apologies if I've missed an obvious answer to this posted elsewhere). Would anyone have any thoughts or even advice to get further on the road to a solution?

Thank you so much in advance!

Answer 1

Here's what I have so far:

import re
rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '
print(rawtext)
matches = re.finditer(r'(\d+(?:\.\d+)?)\s+(\D*?)\s+(\d+)', rawtext)
for m in matches:
   print((m[1], m[2], m[3]))

# output
# TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31
# ('1', 'TRANSACTION OVERVIEW', '10')
# ('1.1', 'Structure diagram', '10')
# ('1.2', 'Risk factors', '10')
# ('1.3', 'Principal parties', '11')
# ('1.4', 'Notes', '12')
# ('1.5', 'Credit structure', '18')
# ('1.6', 'Portfolio information', '19')
# ('1.7', 'Portfolio documentation', '23')
# ('1.8', 'General', '29')
# ('2', 'RISK FACTORS', '31')

I just noticed your edits. Let me see if this even answers your question, and I'll append any edits to this answer.

EDIT : Ok, I think this answers most of the question, at least from what I've interpreted. Now it's just an issue of organizing the data to however you see fit. m[1] is the section number, m[2] is the section name, and m[3] is the page number.

EDIT : Also, to explain the regex pattern, it's basically in 3 parts:

(\\d+(?:\\.\\d+)?) capture the section number which may be an integer or a decimal number
(\\D*?) capture 0 or more non-digits non greedy
(\\d+) capture the page number

EDIT : had a typo in my 1-3 explanation above. Note the ? at the end of (1) (?:\\.\\d+)? . It means match 0 or 1, in other words, the optional floating point value

Answer 2

rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '

title = "TABLE OF CONTENTS"

text = rawtext[20:]
wordList = text.split()

indexList = []
lessonList = []
pageList= []
lessonBlank = []
for element in wordList:

    if lessonBlank == []:
        lessonBlank.append(element)
        indexList.append(element)

    else:

        try:
            temp = float(element)

            pageList.append(int(element))
            lessonBlank = []

        except ValueError as e:

            lessonBlank.append(element)
            lessonList[-1] = lessonList[-1] + " " + element

Split string text following a specific pattern in Python

Question

2 answers

solution1
0 ACCPTED 2018-03-01 15:59:06

solution2
0 2018-03-01 16:00:47

Split string text following a specific pattern in Python

Question

2 answers

solution1 0 ACCPTED 2018-03-01 15:59:06

solution2 0 2018-03-01 16:00:47

solution1
0 ACCPTED 2018-03-01 15:59:06

solution2
0 2018-03-01 16:00:47