簡體   English   中英

如何使用Python分段分割文本文件

[英]How to divide a text file sectionwise using Python

我有一個文本文件如下:

Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

等等。

我只想提取底部部分及其內容(2.1.、2.2. 等)。 我不想匹配“目錄”。 這里的問題是部分名稱和子部分名稱可以相同。 所以我試圖匹配節號,即 2.1.、2.2. 等。我正在嘗試下面的,但沒有運氣。

with open(output_file, 'r') as f:
    for index in range(1,11):
        section = "2." + str(index) + "\."
        self.log.info("Section : " + section)
        for key, group in it.groupby(f, lambda line: line.startswith(section)):
            if not key:
                group = list(group)
                print("Group:" + str(group))

您的原始代碼幾乎沒有錯誤

  • 您使用普通字符串,而不是正則表達式,因此您應該使用. 而不是\\.
  • 您在循環中從文件中讀取,因此第一個循環讀取所有行到文件末尾,下一個循環從文件末尾開始讀取,並且沒有任何內容可讀取。 您應該閱讀所有要在內存中列出的行並使用此列表。 或者你必須在每個循環中移動到文件的開頭 - f.seek(0)
  • 行以空格開頭,因此檢查line.startwith(section)將不匹配,您必須刪除空格 - line.strip().startwith(section)
  • 你必須檢查有key組 - if key:

最少的工作代碼。

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

import itertools as it
import io

#with open(output_file) as f:
with io.StringIO(text) as f:   
    for index in range(1,11):
        section = f"2.{index}."
        print("Section:", section)
        f.seek(0)
        for key, group in it.groupby(f, lambda line: line.strip().startswith(section)):            
            if key:
                print('key:', key)
                group = list(group)
                print("Group:", group)

您沒有顯示預期的結果,但我認為您的代碼給出了錯誤的結果。

Section: 2.1.
key: True
Group: ['  2.1. Section 1\n', '     2.1.1. Subsection 1\n', '     2.1.2. Subsection 2\n']
key: True
Group: ['  2.1. Section 1\n', '    2.1.1. Subsection 1\n']
key: True
Group: ['    2.1.2. Subsection 2\n']
Section: 2.2.
key: True
Group: ['  2.2. Section 2\n', '     2.2.1. Subsection 1\n', '     2.2.2. Subsection 2\n', '     2.2.3. Subsection 3\n', '     2.2.4. Subsection 4\n']
key: True
Group: ['  2.2. Section 1\n', '    2.2.1. Subsection 1\n']
key: True
Group: ['    2.2.2. Subsection 2\n']
Section: 2.3.
Section: 2.4.
Section: 2.5.
Section: 2.6.
Section: 2.7.
Section: 2.8.
Section: 2.9.
Section: 2.10.
---

如果你需要全部2. All Data那么你應該在"\\n2. "上拆分並得到最后一部分

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('---')
parts = text.split('\n2.')
print('2.' + parts[-1])

結果:

---
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.

如果你想分割部分,那么在之前的結果中你可以使用split("\\n 2.") 稍后您可以使用split("\\n 2.")

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)

result = []
all_sections = part.split('\n  2.')
for section in all_sections[1:]:
    print('- section -')
    print('  2.' + section.rstrip())
    all_subsections = section.rstrip().split('\n    2.')
    result += ['    2.'+x for x in all_subsections]
    for subsection in all_subsections[1:]:
        print('- subsection -')
        print('    2.' + subsection.rstrip())
    
print('--- result ---')        
for item in result:
    print(item)
    print('---')

結果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- section -
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.
- subsection -
    2.1.1. Subsection 1
    blah. blah
- subsection -
    2.1.2. Subsection 2
    Blah. Blah.
- section -
  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- subsection -
    2.2.1. Subsection 1
    Blah. Blah.
- subsection -
    2.2.2. Subsection 2
    Blah. Blah.

--- result ---
    2.1. Section 1
---
    2.1.1. Subsection 1
    blah. blah
---
    2.1.2. Subsection 2
    Blah. Blah.
---
    2.2. Section 1
---
    2.2.1. Subsection 1
    Blah. Blah.
---
    2.2.2. Subsection 2
    Blah. Blah.
---

或者您應該使用regex\\n\\s*2.上拆分part \\n\\s*2.

text = '''Table of Contents
1. Intro
2. All Data
  2.1. Section 1
     2.1.1. Subsection 1
     2.1.2. Subsection 2
  2.2. Section 2
     2.2.1. Subsection 1
     2.2.2. Subsection 2
     2.2.3. Subsection 3
     2.2.4. Subsection 4

1. Intro
 blah. blah. blah

2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
'''

print('- part -')
parts = text.split('\n2.')
part = '2.' + parts[-1].rstrip() 
print(part)
print('- end -')

import re

result = re.split('\n\s*2\.', part)
result = ['2.'+x for x in result]
for item in result[1:]:
    print(item)
    print('---')

結果:

- part -
2. All Data
  2.1. Section 1
    2.1.1. Subsection 1
    blah. blah
    2.1.2. Subsection 2
    Blah. Blah.

  2.2. Section 1
    2.2.1. Subsection 1
    Blah. Blah.
    2.2.2. Subsection 2
    Blah. Blah.
- end -

2.1. Section 1
---
2.1.1. Subsection 1
    blah. blah
---
2.1.2. Subsection 2
    Blah. Blah.
---
2.2. Section 1
---
2.2.1. Subsection 1
    Blah. Blah.
---
2.2.2. Subsection 2
    Blah. Blah.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM