简体   繁体   中英

How to divide a text file sectionwise using Python

I have a text file as below:

Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

And so on.

I want to extract the bottom sections and it's contents(2.1., 2.2., etc) only. I don't want to match the 'Table of Contents'. The catch here is the section names and subsection names can be same. So I'm trying to match with section numbers ie, 2.1., 2.2., etc. I'm trying the below but, no luck.

with open(output_file, 'r') as f:
    for index in range(1,11):
        section = "2." + str(index) + "\."
        self.log.info("Section : " + section)
        for key, group in it.groupby(f, lambda line: line.startswith(section)):
            if not key:
                group = list(group)
                print("Group:" + str(group))

Your original code has few mistakes

  • you use normal string, not regex, so you should use . instead of \\.
  • you read from file in loop so first loop reads all lines to the end of file and next loop start reading from the end of file and it has nothing to read. You should read all lines to list in memory and use this list. Or you have to move to the beginning of file in every loop - f.seek(0)
  • lines starts with spaces so checking line.startwith(section) will not match you have to remove spaces - line.strip().startwith(section)
  • you have to check groups which have key - if key:

Minimal working code.

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

import itertools as it
import io

#with open(output_file) as f:
with io.StringIO(text) as f:   
    for index in range(1,11):
        section = f"2.{index}."
        print("Section:", section)
        f.seek(0)
        for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):            
            if key:
                print('key:', key)
                group = list(group)
                print("Group:", group)

You didn't show expected result but I think your code gives wrong result.

Section: 2.1.
key: True
Group: ['  2.1. Section 1\n', '     2.1.1. Subsection 1\n', '     2.1.2. Subsection 2\n']
key: True
Group: ['  2.1. Section 1\n', '    2.1.1. Subsection 1\n']
key: True
Group: ['    2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: ['  2.2. Section 2\n', '     2.2.1. Subsection 1\n', '     2.2.2. Subsection 2\n', '     2.2.3. Subsection 3\n', '     2.2.4. Subsection 4\n']
key: True
Group: ['  2.2. Section 1\n', '    2.2.1. Subsection 1\n']
key: True
Group: ['    2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---

If you need all in 2. All Data then you should rather split on "\\n2. " and get last part

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])

Result:

---
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

If you want to split on sections then in previous result you could use split("\\n 2.") . And later you can every section split on subsections using split("\\n 2.")

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)

result = []
all_sections = part.split('\n  2.')
for section in all_sections[1:]:
    print('- section -')
    print('  2.' + section.rstrip())
    all_subsections = section.rstrip().split('\n    2.')
    result += ['    2.'+x for x in all_subsections]
    for subsection in all_subsections[1:]:
        print('- subsection -')
        print('    2.' + subsection.rstrip())
    
print('--- result ---')        
for item in result:
    print(item)
    print('---')

Result:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- section -
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.
- subsection -
    2.1.1. Subsection 1
    blah. blah
- subsection -
    2.1.2. Subsection 2
    Blah. Blah.
- section -
  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- subsection -
    2.2.1. Subsection 1
    Blah. Blah.
- subsection -
    2.2.2. Subsection 2
    Blah. Blah.

--- result ---
    2.1. Section 1
---
    2.1.1. Subsection 1
    blah. blah
---
    2.1.2. Subsection 2
    Blah. Blah.
---
    2.2. Section 1
---
    2.2.1. Subsection 1
    Blah. Blah.
---
    2.2.2. Subsection 2
    Blah. Blah.
---

Or you should use regex to split part on \\n\\s*2.

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)
print('- end -')

import re

result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
    print(item)
    print('---')

Result:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- end -

2.1. Section 1
---
2.1.1. Subsection 1
    blah. blah
---
2.1.2. Subsection 2
    Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
    Blah. Blah.
---
2.2.2. Subsection 2
    Blah. Blah.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM