简体   繁体   English

如何使用Python分段分割文本文件

[英]How to divide a text file sectionwise using Python

I have a text file as below:我有一个文本文件如下:

Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

And so on.等等。

I want to extract the bottom sections and it's contents(2.1., 2.2., etc) only.我只想提取底部部分及其内容(2.1.、2.2. 等)。 I don't want to match the 'Table of Contents'.我不想匹配“目录”。 The catch here is the section names and subsection names can be same.这里的问题是部分名称和子部分名称可以相同。 So I'm trying to match with section numbers ie, 2.1., 2.2., etc. I'm trying the below but, no luck.所以我试图匹配节号,即 2.1.、2.2. 等。我正在尝试下面的,但没有运气。

with open(output_file, 'r') as f:
    for index in range(1,11):
        section = "2." + str(index) + "\."
        self.log.info("Section : " + section)
        for key, group in it.groupby(f, lambda line: line.startswith(section)):
            if not key:
                group = list(group)
                print("Group:" + str(group))

Your original code has few mistakes您的原始代码几乎没有错误

  • you use normal string, not regex, so you should use .您使用普通字符串,而不是正则表达式,因此您应该使用. instead of \\.而不是\\.
  • you read from file in loop so first loop reads all lines to the end of file and next loop start reading from the end of file and it has nothing to read.您在循环中从文件中读取,因此第一个循环读取所有行到文件末尾,下一个循环从文件末尾开始读取,并且没有任何内容可读取。 You should read all lines to list in memory and use this list.您应该阅读所有要在内存中列出的行并使用此列表。 Or you have to move to the beginning of file in every loop - f.seek(0)或者你必须在每个循环中移动到文件的开头 - f.seek(0)
  • lines starts with spaces so checking line.startwith(section) will not match you have to remove spaces - line.strip().startwith(section)行以空格开头,因此检查line.startwith(section)将不匹配,您必须删除空格 - line.strip().startwith(section)
  • you have to check groups which have key - if key:你必须检查有key组 - if key:

Minimal working code.最少的工作代码。

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

import itertools as it
import io

#with open(output_file) as f:
with io.StringIO(text) as f:   
    for index in range(1,11):
        section = f"2.{index}."
        print("Section:", section)
        f.seek(0)
        for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):            
            if key:
                print('key:', key)
                group = list(group)
                print("Group:", group)

You didn't show expected result but I think your code gives wrong result.您没有显示预期的结果,但我认为您的代码给出了错误的结果。

Section: 2.1.
key: True
Group: ['  2.1. Section 1\n', '     2.1.1. Subsection 1\n', '     2.1.2. Subsection 2\n']
key: True
Group: ['  2.1. Section 1\n', '    2.1.1. Subsection 1\n']
key: True
Group: ['    2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: ['  2.2. Section 2\n', '     2.2.1. Subsection 1\n', '     2.2.2. Subsection 2\n', '     2.2.3. Subsection 3\n', '     2.2.4. Subsection 4\n']
key: True
Group: ['  2.2. Section 1\n', '    2.2.1. Subsection 1\n']
key: True
Group: ['    2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---

If you need all in 2. All Data then you should rather split on "\\n2. " and get last part如果你需要全部2. All Data那么你应该在"\\n2. "上拆分并得到最后一部分

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])

Result:结果:

---
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

If you want to split on sections then in previous result you could use split("\\n 2.") .如果你想分割部分,那么在之前的结果中你可以使用split("\\n 2.") And later you can every section split on subsections using split("\\n 2.")稍后您可以使用split("\\n 2.")

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)

result = []
all_sections = part.split('\n  2.')
for section in all_sections[1:]:
    print('- section -')
    print('  2.' + section.rstrip())
    all_subsections = section.rstrip().split('\n    2.')
    result += ['    2.'+x for x in all_subsections]
    for subsection in all_subsections[1:]:
        print('- subsection -')
        print('    2.' + subsection.rstrip())
    
print('--- result ---')        
for item in result:
    print(item)
    print('---')

Result:结果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- section -
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.
- subsection -
    2.1.1. Subsection 1
    blah. blah
- subsection -
    2.1.2. Subsection 2
    Blah. Blah.
- section -
  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- subsection -
    2.2.1. Subsection 1
    Blah. Blah.
- subsection -
    2.2.2. Subsection 2
    Blah. Blah.

--- result ---
    2.1. Section 1
---
    2.1.1. Subsection 1
    blah. blah
---
    2.1.2. Subsection 2
    Blah. Blah.
---
    2.2. Section 1
---
    2.2.1. Subsection 1
    Blah. Blah.
---
    2.2.2. Subsection 2
    Blah. Blah.
---

Or you should use regex to split part on \\n\\s*2.或者您应该使用regex\\n\\s*2.上拆分part \\n\\s*2.

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)
print('- end -')

import re

result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
    print(item)
    print('---')

Result:结果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- end -

2.1. Section 1
---
2.1.1. Subsection 1
    blah. blah
---
2.1.2. Subsection 2
    Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
    Blah. Blah.
---
2.2.2. Subsection 2
    Blah. Blah.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM