繁体   English   中英

如何使用Python分段分割文本文件

[英]How to divide a text file sectionwise using Python

我有一个文本文件如下:

Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

等等。

我只想提取底部部分及其内容(2.1.、2.2. 等)。 我不想匹配“目录”。 这里的问题是部分名称和子部分名称可以相同。 所以我试图匹配节号,即 2.1.、2.2. 等。我正在尝试下面的,但没有运气。

with open(output_file, 'r') as f:
    for index in range(1,11):
        section = "2." + str(index) + "\."
        self.log.info("Section : " + section)
        for key, group in it.groupby(f, lambda line: line.startswith(section)):
            if not key:
                group = list(group)
                print("Group:" + str(group))

您的原始代码几乎没有错误

  • 您使用普通字符串,而不是正则表达式,因此您应该使用. 而不是\\.
  • 您在循环中从文件中读取,因此第一个循环读取所有行到文件末尾,下一个循环从文件末尾开始读取,并且没有任何内容可读取。 您应该阅读所有要在内存中列出的行并使用此列表。 或者你必须在每个循环中移动到文件的开头 - f.seek(0)
  • 行以空格开头,因此检查line.startwith(section)将不匹配,您必须删除空格 - line.strip().startwith(section)
  • 你必须检查有key组 - if key:

最少的工作代码。

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

import itertools as it
import io

#with open(output_file) as f:
with io.StringIO(text) as f:   
    for index in range(1,11):
        section = f"2.{index}."
        print("Section:", section)
        f.seek(0)
        for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):            
            if key:
                print('key:', key)
                group = list(group)
                print("Group:", group)

您没有显示预期的结果,但我认为您的代码给出了错误的结果。

Section: 2.1.
key: True
Group: ['  2.1. Section 1\n', '     2.1.1. Subsection 1\n', '     2.1.2. Subsection 2\n']
key: True
Group: ['  2.1. Section 1\n', '    2.1.1. Subsection 1\n']
key: True
Group: ['    2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: ['  2.2. Section 2\n', '     2.2.1. Subsection 1\n', '     2.2.2. Subsection 2\n', '     2.2.3. Subsection 3\n', '     2.2.4. Subsection 4\n']
key: True
Group: ['  2.2. Section 1\n', '    2.2.1. Subsection 1\n']
key: True
Group: ['    2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---

如果你需要全部2. All Data那么你应该在"\\n2. "上拆分并得到最后一部分

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])

结果:

---
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

如果你想分割部分,那么在之前的结果中你可以使用split("\\n 2.") 稍后您可以使用split("\\n 2.")

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)

result = []
all_sections = part.split('\n  2.')
for section in all_sections[1:]:
    print('- section -')
    print('  2.' + section.rstrip())
    all_subsections = section.rstrip().split('\n    2.')
    result += ['    2.'+x for x in all_subsections]
    for subsection in all_subsections[1:]:
        print('- subsection -')
        print('    2.' + subsection.rstrip())
    
print('--- result ---')        
for item in result:
    print(item)
    print('---')

结果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- section -
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.
- subsection -
    2.1.1. Subsection 1
    blah. blah
- subsection -
    2.1.2. Subsection 2
    Blah. Blah.
- section -
  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- subsection -
    2.2.1. Subsection 1
    Blah. Blah.
- subsection -
    2.2.2. Subsection 2
    Blah. Blah.

--- result ---
    2.1. Section 1
---
    2.1.1. Subsection 1
    blah. blah
---
    2.1.2. Subsection 2
    Blah. Blah.
---
    2.2. Section 1
---
    2.2.1. Subsection 1
    Blah. Blah.
---
    2.2.2. Subsection 2
    Blah. Blah.
---

或者您应该使用regex\\n\\s*2.上拆分part \\n\\s*2.

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)
print('- end -')

import re

result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
    print(item)
    print('---')

结果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- end -

2.1. Section 1
---
2.1.1. Subsection 1
    blah. blah
---
2.1.2. Subsection 2
    Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
    Blah. Blah.
---
2.2.2. Subsection 2
    Blah. Blah.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM