[英]How to divide a text file sectionwise using Python
我有一個文本文件如下:
Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
等等。
我只想提取底部部分及其內容(2.1.、2.2. 等)。 我不想匹配“目錄”。 這里的問題是部分名稱和子部分名稱可以相同。 所以我試圖匹配節號,即 2.1.、2.2. 等。我正在嘗試下面的,但沒有運氣。
with open(output_file, 'r') as f:
for index in range(1,11):
section = "2." + str(index) + "\."
self.log.info("Section : " + section)
for key, group in it.groupby(f, lambda line: line.startswith(section)):
if not key:
group = list(group)
print("Group:" + str(group))
您的原始代碼幾乎沒有錯誤
.
而不是\\.
f.seek(0)
line.startwith(section)
將不匹配,您必須刪除空格 - line.strip().startwith(section)
key
組 - if key:
最少的工作代碼。
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
import itertools as it
import io
#with open(output_file) as f:
with io.StringIO(text) as f:
for index in range(1,11):
section = f"2.{index}."
print("Section:", section)
f.seek(0)
for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):
if key:
print('key:', key)
group = list(group)
print("Group:", group)
您沒有顯示預期的結果,但我認為您的代碼給出了錯誤的結果。
Section: 2.1.
key: True
Group: [' 2.1. Section 1\n', ' 2.1.1. Subsection 1\n', ' 2.1.2. Subsection 2\n']
key: True
Group: [' 2.1. Section 1\n', ' 2.1.1. Subsection 1\n']
key: True
Group: [' 2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: [' 2.2. Section 2\n', ' 2.2.1. Subsection 1\n', ' 2.2.2. Subsection 2\n', ' 2.2.3. Subsection 3\n', ' 2.2.4. Subsection 4\n']
key: True
Group: [' 2.2. Section 1\n', ' 2.2.1. Subsection 1\n']
key: True
Group: [' 2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---
如果你需要全部2. All Data
那么你應該在"\\n2. "
上拆分並得到最后一部分
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])
結果:
---
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
如果你想分割部分,那么在之前的結果中你可以使用split("\\n 2.")
。 稍后您可以使用split("\\n 2.")
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip()
print(part)
result = []
all_sections = part.split('\n 2.')
for section in all_sections[1:]:
print('- section -')
print(' 2.' + section.rstrip())
all_subsections = section.rstrip().split('\n 2.')
result += [' 2.'+x for x in all_subsections]
for subsection in all_subsections[1:]:
print('- subsection -')
print(' 2.' + subsection.rstrip())
print('--- result ---')
for item in result:
print(item)
print('---')
結果:
- part -
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- section -
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
- subsection -
2.1.1. Subsection 1
blah. blah
- subsection -
2.1.2. Subsection 2
Blah. Blah.
- section -
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- subsection -
2.2.1. Subsection 1
Blah. Blah.
- subsection -
2.2.2. Subsection 2
Blah. Blah.
--- result ---
2.1. Section 1
---
2.1.1. Subsection 1
blah. blah
---
2.1.2. Subsection 2
Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
Blah. Blah.
---
2.2.2. Subsection 2
Blah. Blah.
---
或者您應該使用regex
在\\n\\s*2.
上拆分part
\\n\\s*2.
text = '''Table of Contents
1. Intro
2. All Data
2.1. Section 1
2.1.1. Subsection 1
2.1.2. Subsection 2
2.2. Section 2
2.2.1. Subsection 1
2.2.2. Subsection 2
2.2.3. Subsection 3
2.2.4. Subsection 4
1. Intro
blah. blah. blah
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
'''
print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip()
print(part)
print('- end -')
import re
result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
print(item)
print('---')
結果:
- part -
2. All Data
2.1. Section 1
2.1.1. Subsection 1
blah. blah
2.1.2. Subsection 2
Blah. Blah.
2.2. Section 1
2.2.1. Subsection 1
Blah. Blah.
2.2.2. Subsection 2
Blah. Blah.
- end -
2.1. Section 1
---
2.1.1. Subsection 1
blah. blah
---
2.1.2. Subsection 2
Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
Blah. Blah.
---
2.2.2. Subsection 2
Blah. Blah.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.