使用正则表达式将 txt 文件拆分为多个新文件

Question

我正在呼吁 Stack Overflow 的集体智慧，因为我正竭尽全力想弄清楚如何做到这一点，而且我是一个新手自学成才的编码员。

我有一个 txt 文件给编辑的信，我需要将其拆分成各自的文件。

这些文件都以相对相同的方式格式化：

For once, before offering such generous but the unasked for advice, put yourselves in...

Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...

Why is it that The Times does not urge totalitarian Arab slates and terrorist...

PAUL STONEHILL Los Angeles

There you go again. Your editorial again makes groundless criticisms of the Israeli...

On Dec. 7 you called proportional representation “bizarre," despite its use in the...

Proportional representation distorts Israeli politics? Huh? If Israel changes the...

MATTHEW SHUGART Laguna Beach

Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...

Although the mayor did not support Proposition U (the slow-growth initiative) his...

If West Los Angeles is any indication of the no-growth policy, where do we go from here?

MARJORIE L. SCHWARTZ Los Angeles

我认为最好的解决方法是尝试使用正则表达式来识别以全大写字母名称开头的行，因为这是真正分辨一个字母结束和另一个字母开始的唯一方法。

我尝试了很多不同的方法，但似乎没有什么是正确的。 我看到的所有其他答案都是基于可重复的行或词。 （例如这里发布的答案how to split single txt file into multiple txt files by Python和这里Python read through file until match, read until next pattern ）。 当我必须调整它以接受我的所有大写单词的正则表达式时，这一切似乎都不起作用。

我设法得到的最接近的是下面的代码。 它创建正确数量的文件。 但是在创建第二个文件后，一切都出错了。 第三个文件是空的，其余所有的文本都是乱序的和/或不完整的。 应该在文件 4 中的段落在文件 5 或文件 7 等中，或者完全丢失。

import re
thefile = raw_input('Filename to split: ')
name_occur = [] 
full_file = []
pattern = re.compile("^[A-Z]{4,}")

with open (thefile, 'rt') as in_file:
    for line in in_file:
        full_file.append(line)
        if pattern.search(line):
            name_occur.append(line) 

totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)

while letters <= totalFiles:
    f1 = open(thefile + '-' + str(letters) + ".txt", "a")
    doIHaveToCopyTheLine = False
    ignoreLines = False
    for line in full_file:
        if not ignoreLines:
            f1.write(line)
            full_file.remove(line)
        if pattern.search(line):
            doIHaveToCopyTheLine = True
            ignoreLines = True
    letters += 1
    f1.close()

我愿意完全放弃这种方法并以另一种方式进行（但仍在 Python 中）。 任何帮助或建议将不胜感激。 请假设我是一个没有经验的新手，如果你足够棒能花时间帮助我的话。

Answer 1

我采用了一种更简单的方法并避免了正则表达式。 这里的策略本质上是统计前三个单词中的大写字母，并确保它们通过一定的逻辑。 我选择第一个单词是大写的，第二个或第三个单词也是大写的，但您可以根据需要进行调整。 然后，这会将每个字母写入与原始文件同名的新文件（注意：它假定您的文件具有类似.txt 之类的扩展名），但会附加一个递增的整数。 尝试一下，看看它如何为您服务。

import string

def split_letters(fullpath):
    current_letter = []
    letter_index = 1
    fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)

    with open(fullpath, 'r') as letters_file:
        letters = letters_file.readlines()
    for line in letters:
        words = line.split()
        upper_words = []
        for word in words:
            upper_word = ''.join(
                c for c in word if c in string.ascii_uppercase)
            upper_words.append(upper_word)

        len_upper_words = len(upper_words)
        first_word_upper = len_upper_words and len(upper_words[0]) > 1
        second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
        third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
        if first_word_upper and (second_word_upper or third_word_upper):
            current_letter.append(line)
            new_filename = '{0}{1}.{2}'.format(
                fullpath_base, letter_index, fullpath_ext)
            with open(new_filename, 'w') as new_letter:
                new_letter.writelines(current_letter)
            current_letter = []
            letter_index += 1

        else:
            current_letter.append(line)

我在您的示例输入上对其进行了测试，并且运行良好。

Answer 2

虽然另一个答案是合适的，但您可能仍然对使用正则表达式拆分文件感到好奇。

   smallfile = None
   buf = ""
   with  open ('input_file.txt', 'rt') as f:
      for line in f:
          buf += str(line)
          if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
              if smallfile:
                  smallfile.close()
              match = re.findall(r'^([A-Z\s\.]+\b)' , line)
              smallfile_name = '{}.txt'.format(match[0])
              smallfile = open(smallfile_name, 'w')
              smallfile.write(buf)
              buf = ""
      if smallfile:
          smallfile.close()

Answer 3

如果您在 Linux 上运行，请使用csplit 。

否则，请检查这两个线程：

如何使用 python 将一个文本文件拆分为多个文本文件？

如何在正则表达式中匹配“在此字符序列之前的任何内容”？

使用正则表达式将 txt 文件拆分为多个新文件

问题描述

3 个解决方案

解决方案1
1 已采纳 2017-02-07 07:05:52

解决方案2
1 2017-02-07 08:02:56

解决方案3
1 2019-10-27 09:05:28

使用正则表达式将 txt 文件拆分为多个新文件

问题描述

3 个解决方案

解决方案1 1 已采纳 2017-02-07 07:05:52

解决方案2 1 2017-02-07 08:02:56

解决方案3 1 2019-10-27 09:05:28

解决方案1
1 已采纳 2017-02-07 07:05:52

解决方案2
1 2017-02-07 08:02:56

解决方案3
1 2019-10-27 09:05:28