I have a text file as below:
Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
And so on.
I want to extract the bottom sections and it's contents(2.1., 2.2., etc) only. I don't want to match the 'Table of Contents'. The catch here is the section names and subsection names can be same. So I'm trying to match with section numbers ie, 2.1., 2.2., etc. I'm trying the below but, no luck.
with open(output_file, 'r') as f:
for index in range(1,11):
section = "2." + str(index) + "\."
self.log.info("Section : " + section)
for key, group in it.groupby(f, lambda line: line.startswith(section)):
if not key:
group = list(group)
print("Group:" + str(group))
Your original code has few mistakes
.
instead of \\.
f.seek(0)
line.startwith(section)
will not match you have to remove spaces - line.strip().startwith(section)
key
- if key:
Minimal working code.
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
import itertools as it
import io
#with open(output_file) as f:
with io.StringIO(text) as f:
for index in range(1,11):
section = f"2.{index}."
print("Section:", section)
f.seek(0)
for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):
if key:
print('key:', key)
group = list(group)
print("Group:", group)
You didn't show expected result but I think your code gives wrong result.
Section: 2.1.
key: True
Group: [' 2.1. Section 1\n', ' 2.1.1. Subsection 1\n', ' 2.1.2. Subsection 2\n']
key: True
Group: [' 2.1. Section 1\n', ' 2.1.1. Subsection 1\n']
key: True
Group: [' 2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: [' 2.2. Section 2\n', ' 2.2.1. Subsection 1\n', ' 2.2.2. Subsection 2\n', ' 2.2.3. Subsection 3\n', ' 2.2.4. Subsection 4\n']
key: True
Group: [' 2.2. Section 1\n', ' 2.2.1. Subsection 1\n']
key: True
Group: [' 2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---
If you need all in 2. All Data
then you should rather split on "\\n2. "
and get last part
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])
Result:
---
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
If you want to split on sections then in previous result you could use split("\\n 2.")
. And later you can every section split on subsections using split("\\n 2.")
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip()
print(part)
result = []
all_sections = part.split('\n 2.')
for section in all_sections[1:]:
print('- section -')
print(' 2.' + section.rstrip())
all_subsections = section.rstrip().split('\n 2.')
result += [' 2.'+x for x in all_subsections]
for subsection in all_subsections[1:]:
print('- subsection -')
print(' 2.' + subsection.rstrip())
print('--- result ---')
for item in result:
print(item)
print('---')
Result:
- part -
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- section -
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
- subsection -
2.1.1. Subsection 1
blah. blah
- subsection -
2.1.2. Subsection 2
Blah. Blah.
- section -
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- subsection -
2.2.1. Subsection 1
Blah. Blah.
- subsection -
2.2.2. Subsection 2
Blah. Blah.
--- result ---
2.1. Section 1
---
2.1.1. Subsection 1
blah. blah
---
2.1.2. Subsection 2
Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
Blah. Blah.
---
2.2.2. Subsection 2
Blah. Blah.
---
Or you should use regex
to split part
on \\n\\s*2.
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip()
print(part)
print('- end -')
import re
result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
print(item)
print('---')
Result:
- part -
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- end -
2.1. Section 1
---
2.1.1. Subsection 1
blah. blah
---
2.1.2. Subsection 2
Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
Blah. Blah.
---
2.2.2. Subsection 2
Blah. Blah.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.