简体   繁体   中英

Split string text following a specific pattern in Python

I'm working with some text that follows a specific pattern (it's a Table of Contents) that I'm trying to extract. For example,

rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '

The text follows a specific pattern, namely: (Section Number) then (Section Name) and finally (Page Number).

I'm not very good with regular expressions but have cobbled together some checks to extract and put these variables in a dataframe.

This works fine for extracting the Section Name and Section Page (though I'm sure it could be improved), but I can't identify the Section Number using this method, since we can have both integers (eg '2' for the 'RISK FACTORS' section), decimals (eg '1.1' for the 'Structure diagram' section), or none at all (eg the 'TABLE OF CONTENTS' text has no section number preceding it).

I think a more efficient way would be to pass everything into a python function (re.match? re.findall?) and extract everything according to the pattern itself, ie NUMBERS OR DECIMALS (IF PRESENT) ; (Letters and spaces in between the letters) ; NUMBERS

So this would mean having an output like:

import pandas as pd
import re
import numpy as np
toc = pd.DataFrame()
toc['SectionName'] = re.findall(r'[A-Za-z-]+[ ]+[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*[A-Za-z]*[ ]*', rawtext) # get the section names
toc['SectionPage'] = re.findall(r'[ ]+[0-9]*[ ]+', rawtext) # get the page numbers
toc.loc[1,'SectionNum'] = np.nan
toc.loc[1,'SectionNum'] = 1
toc.loc[2,'SectionNum'] = 1.1
toc.loc[3,'SectionNum'] = 1.2
toc.loc[4,'SectionNum'] = 1.3
toc.loc[5,'SectionNum'] = 1.4
toc.loc[6,'SectionNum'] = 1.5
toc.loc[7,'SectionNum'] = 1.6
toc.loc[8,'SectionNum'] = 1.7
toc.loc[9,'SectionNum'] = 1.8
toc.loc[10,'SectionNum'] = 2

toc = toc[['SectionNum', 'SectionName', 'SectionPage']]
print(toc)

I really can't manage this though; I've been trying for a few days now and have tried searching all over Stack Overflow but no luck (apologies if I've missed an obvious answer to this posted elsewhere). Would anyone have any thoughts or even advice to get further on the road to a solution?

Thank you so much in advance!

Here's what I have so far:

import re
rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '
print(rawtext)
matches = re.finditer(r'(\d+(?:\.\d+)?)\s+(\D*?)\s+(\d+)', rawtext)
for m in matches:
   print((m[1], m[2], m[3]))

# output
# TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31
# ('1', 'TRANSACTION OVERVIEW', '10')
# ('1.1', 'Structure diagram', '10')
# ('1.2', 'Risk factors', '10')
# ('1.3', 'Principal parties', '11')
# ('1.4', 'Notes', '12')
# ('1.5', 'Credit structure', '18')
# ('1.6', 'Portfolio information', '19')
# ('1.7', 'Portfolio documentation', '23')
# ('1.8', 'General', '29')
# ('2', 'RISK FACTORS', '31')

I just noticed your edits. Let me see if this even answers your question, and I'll append any edits to this answer.

EDIT : Ok, I think this answers most of the question, at least from what I've interpreted. Now it's just an issue of organizing the data to however you see fit. m[1] is the section number, m[2] is the section name, and m[3] is the page number.

EDIT : Also, to explain the regex pattern, it's basically in 3 parts:

  1. (\\d+(?:\\.\\d+)?) capture the section number which may be an integer or a decimal number
  2. (\\D*?) capture 0 or more non-digits non greedy
  3. (\\d+) capture the page number

EDIT : had a typo in my 1-3 explanation above. Note the ? at the end of (1) (?:\\.\\d+)? . It means match 0 or 1, in other words, the optional floating point value

rawtext = 'TABLE OF CONTENTS 1 TRANSACTION OVERVIEW 10 1.1 Structure diagram 10 1.2 Risk factors 10 1.3 Principal parties 11 1.4 Notes 12 1.5 Credit structure 18 1.6 Portfolio information 19 1.7 Portfolio documentation 23 1.8 General 29 2 RISK FACTORS 31 '

title = "TABLE OF CONTENTS"

text = rawtext[20:]
wordList = text.split()

indexList = []
lessonList = []
pageList= []
lessonBlank = []
for element in wordList:

    if lessonBlank == []:
        lessonBlank.append(element)
        indexList.append(element)

    else:

        try:
            temp = float(element)

            pageList.append(int(element))
            lessonBlank = []

        except ValueError as e:

            lessonBlank.append(element)
            lessonList[-1] = lessonList[-1] + " " + element

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM