用于组合txt文件中的行的Python

Question

a question regarding combine lines in a txt file. 关于txt文件中的组合线的问题。

file contents as below (movie subtitles). 文件内容如下（电影字幕）。 I want to combine the subtitles, those English words and sentences in each paragraph into 1 line, instead of now showing either 1, 2 or 3 lines separably. 我想把每个段落中的字幕，英语单词和句子组合成1行，而不是现在分别显示1,2或3行。

could you please advise which method is feasible in Python? 你能告诉我哪种方法在Python中可行吗？ many thanks. 非常感谢。

1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here
in front of you.

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour
and just stick to the cards.

3
00:00:31,935 --> 00:00:34,603
There's been speculation that I was
involved in the events that occurred
on the freeway and the rooftop...

4
00:00:36,189 --> 00:00:39,233
Sorry, Mr Stark, do you
honestly expect us to believe that

5
00:00:39,317 --> 00:00:42,903
that was a bodyguard
in a suit that conveniently appeared,

6
00:00:42,987 --> 00:00:45,698
despite the fact
that you sorely despise bodyguards?

7
00:00:45,782 --> 00:00:46,907
Yes.

8
00:00:46,991 --> 00:00:51,662
And this mysterious bodyguard
was somehow equipped

Answer 1

Intuitive solution 直观的解决方案

A simple solution based on the 4 types of lines you can have: 基于您可以拥有的4种类型的简单解决方案：

an empty line 一条空行
a number indicating the position (no letters) 表示位置的数字（无字母）
a timing for the subtitle (with a specific pattern; no letters) 字幕的时间（具有特定模式;没有字母）
text 文本

You can just loop over each line, classifying them, and then act accordingly. 您可以循环遍历每一行，对它们进行分类，然后相应地采取行动。

In fact, the "action" for a non-text not-empty line (timeline and numeric) is the same. 事实上，非文本非空行（时间轴和数字）的“动作”是相同的。 Thus: 从而：

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

new = ''

for line in exampleText.split('\n'):
    if line == '':
        new += '\n\n'
    elif re.search('[a-zA-Z]', line):  # check if there is text
        new += line + ' ' 
    else:
        new += line + '\n'

Result: 结果：

>>> print(new)
1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here in front of you. 

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour and just stick to the cards. 
...

Regex explained: 正则表达式解释说：

[] indicates any of the characters inside []表示里面的任何字符
az indicates the range of characters az az表示字符az的范围
AZ indicates the range of characters AZ AZ表示字符AZ的范围

Answer 2

The pattern seems to be: 模式似乎是：

a line with just a number, 一行只有一个数字，
the next line with timing information, and 下一行有时间信息，和
one or more lines of text, separated by a blank line. 一行或多行文本，用空行分隔。

I would write a loop that reads lines 1) and 2), and then a nested loop that reads lines 3) until it finds a blank line. 我会写一个读取第1）和第2行的循环，然后是一个读取第3行的嵌套循环，直到找到一个空行。 This nested loop could join those lines into a single line. 这个嵌套循环可以将这些行连接成一行。

Answer 3

Still working on the 1st line..rest is what you expected. 仍然在第一线上工作..这是你所期望的。

with open('/home/cam/Documents/1.txt','rb') as f:
    f_out=open('mytxt','w+')


    lines=f.readlines()
    new_lines=[line.strip() if line == '\n' else line for line in lines]
    #print new_lines



    space_index=[i for i, x in enumerate(new_lines) if x == ""]
    new_list=[0]+space_index

    for i in range(len(new_list)):
        try:
            mylist=new_lines[new_list[i]:new_list[i+1]]
        except IndexError, e:
            mylist=new_lines[new_list[i]:]


        mylist=mylist[1:]

        mylist1=[i.strip() for i in mylist]


        mylist1[2] = " ".join(mylist1[2:])
        final=mylist1[:3]

        finallines=[i+"\n" for i in final]
        print finallines

        for i in finallines:
            f_out.write(i)

Answer 4

Loading requirements: 装载要求：

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

Concise one-liner 简洁的单行

re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', re.sub('([^0-9])\n', '\g<1> ', exampleText))

The first replacement replaces all text ending with a newline with the text ending with a space: 第一个替换替换以换行符结尾的所有文本，文本以空格结尾：

tmp = re.sub('([^0-9])\n', '\g<1> ', exampleText)

The previous replacement means we lose the newline at the end of the last part of the texts. 之前的替换意味着我们在文本的最后部分末尾丢失了换行符。 Then the second replacement adds a newline in front of these numeric lines: 然后第二个替换在这些数字行前添加换行符：

re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', tmp)

用于组合txt文件中的行的Python

问题描述

4 个解决方案

解决方案1
2 已采纳 2015-05-15 10:51:10

Intuitive solution 直观的解决方案

解决方案2
1 2015-05-15 04:52:48

解决方案3
1 2015-05-15 06:05:12

解决方案4
1 2015-05-15 11:32:43

Concise one-liner 简洁的单行

用于组合txt文件中的行的Python

问题描述

4 个解决方案

解决方案1 2 已采纳 2015-05-15 10:51:10

Intuitive solution 直观的解决方案

解决方案2 1 2015-05-15 04:52:48

解决方案3 1 2015-05-15 06:05:12

解决方案4 1 2015-05-15 11:32:43

Concise one-liner 简洁的单行

解决方案1
2 已采纳 2015-05-15 10:51:10

解决方案2
1 2015-05-15 04:52:48

解决方案3
1 2015-05-15 06:05:12

解决方案4
1 2015-05-15 11:32:43