拆分文本文件 Python

Question

我正在处理这样的文本文件：

第01章

Lorem ipsum

dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incidudunt

第02章

结构性脂肪

sed 做 eiusmod tempor

第03章

et dolore magna aliqua。

带有“chapter”、“Chapter”、“CHAPTER”等分隔符以及 1 位或 2 位数字（“Chapter 1”或“Chapter 01”）。

我设法使用.open()和.read()在 Python 中打开和读取文件

mytext = myfile.read()

现在我需要拆分我的字符串，以获取“第 XX 章”的文本。

对于第 02 章，这将是：

结构性脂肪

sed 做 eiusmod tempor

我是 Python 新手，我读过关于 regex、match、map 或 split 的信息，但是……嗯……

（我正在写一个 Gimp Python-fu 插件，所以我使用 Gimp 中捆绑的 Python 版本，即 2.7.15）。

Answer 1

您可以像这样使用正则表达式：

import re

split_text = re.split("Chapter [0-9]+\n",  # splits on "Chapter " + numbers + newline
                      mytext, 
                      flags=re.IGNORECASE) # splits on "CHAPTER"/"chapter"/"Chapter" etc

>>> split_text
['', '\nLorem ipsum\n\ndolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt\n\n', '\nconsectetur adipiscing\n\nsed do eiusmod tempor\n\n', '\net dolore magna aliqua.']

您现在可以通过split_text的索引从每章中选择文本，例如：

print(split_text[2])

>>> 
consectetur adipiscing

sed do eiusmod tempor

Answer 2

你可以试试这个

chapter = [""]
for i in range(1,4):

  nb1=text.find("Chapter "+ "%02d" % (i,))
  nb2=text.find("Chapter "+ "%02d" % (i+1,))

  chapter.append(text[nb1:nb2])

for i in range(1,4):
    print(chapter[i])

或使用正则表达式：

import re

chapter = re.split("Chapter [0-4]+\n", text)

for i in range(1,4):
    print(chapter[i])

Answer 3

import re # removing void strings. splitted_str = list(filter(lambda x: x != '', re.split("Chapter [0-9]+", my_text))) print(splitted_str)

拆分文本文件 Python

问题描述

3 个解决方案

解决方案1
1 已采纳 2018-07-21 10:42:09

解决方案2
0 2018-07-21 10:40:25

解决方案3
0 2022-06-23 06:59:16

拆分文本文件 Python

问题描述

3 个解决方案

解决方案1 1 已采纳 2018-07-21 10:42:09

解决方案2 0 2018-07-21 10:40:25

解决方案3 0 2022-06-23 06:59:16

解决方案1
1 已采纳 2018-07-21 10:42:09

解决方案2
0 2018-07-21 10:40:25

解决方案3
0 2022-06-23 06:59:16