简体   繁体   English

如何使用Python将中文txt文件中的每两行相邻行合并为一行

[英]How to combine every two adjoining lines in Chinese txt file into one line with Python

I have a Chinese txt file with thousands of sentence lines as following, 我有一个中文txt文件,其中包含数千行句子,如下所示:

  1. line 1 1行
  2. line 2 2号线
  3. line 3 3号线
  4. line 4 4号线

………… …………

I want to combine every two adjoining lines into one line,it should be transformed as: 我想将每两个相邻的行合并为一行,应将其转换为:

  1. line 1 + space + line 2 第1行+空格+第2行
  2. line 3 + space + line 4 第3行+空格+第4行
  3. line 5 + space + line 6 ………… 第5行+空格+第6行…………

How can I use Python to finish the combination? 如何使用Python完成组合?

You don't need Python for that, sed is enough: 您不需要Python, sed就足够了:

$ seq 15 > lines
$ cat lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ sed 'N;s/\n/ /g' lines
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15

According to man sed: 据曼·塞德说:

n N Read/append the next line of input into the pattern space. n N将下一行输入读/追加到模式空间。

and

s/regexp/replacement/ s / regexp / replacement /

Attempt to match regexp against the pattern space. 尝试将正则表达式与模式空间进行匹配。 If successful, replace that portion matched with replacement. 如果成功,则替换与替换匹配的那部分。 The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \\1 through \\9 to refer to the corresponding matching sub-expressions in the regexp. 替换可能包含特殊字符&来表示匹配的模式空间部分,特殊转义\\ 1到\\ 9表示正则表达式中的相应匹配子表达式。

And, as sed execute the given script for each line, the newline character is not included in the pattern space (it would be redundant to include it). 并且,当sed为每行执行给定脚本时,换行符不包含在模式空间中(将其包括在内是多余的)。 So the executed sequence is: 所以执行的顺序是:

  • sed loads a line in the pattern space sed在模式空间中加载一条线
  • N : Append the next line to the pattern space, now that we have two lines in the pattern space, they have to be separated by a newline, so we have a newline character in the middle of the pattern space N :将下一行追加到模式空间,现在我们在模式空间中有两行,它们必须用换行符分隔,因此我们在模式空间的中间有一个换行符
  • s/\\n/ / replace the newline character by a space s/\\n/ /用空格替换换行符
  • sed now print the line as there's nothing more to do on this line sed现在打印该行,因为此行无事可做
  • And starts again with the next line 然后从下一行开始
  1. You would read the file and obtain a list of lines (ie list of strings) 您将读取文件并获得行列表(即字符串列表)
  2. then you could use a list comprehension, like this one: 那么您可以使用列表推导,如下所示:

    [ l1 + ' ' + l2 for l1,l2 in zip(lines[::2], lines[1::2]) ]

Note, this means you'll have to have an equal number of lines. 注意,这意味着您必须有相等数量的行。 so if len(lines)%2==1 then use lines[-1] to print out/use the last line by itself 因此,如果len(lines)%2==1则使用lines[-1]单独打印/使用最后一行

You should iterate on your file like follows: 您应该像下面这样迭代文件:

with open('./chinese.txt') as my_file:
    for line in my_file:
        try:
            print '{} {}'.format(line.strip(), my_file.next())
        except StopIteration:  # Manage case: number of lines is an odd number
            print line

A file is an iterator over lines in Python. 文件是Python中各行的迭代器。 You could use the itertools' grouper() recipe, to group the lines into pairs : 您可以使用itertools的grouper()配方将各行分成几对

#!/usr/bin/env python2
from itertools import izip_longest

with open('Chinese.txt') as file:
    for line, another in izip_longest(file, file, fillvalue=''):
        print line.rstrip('\n'), another,

The comma at the end of the print statement is the file.softspace hack, to avoid duplicating newlines . print语句末尾的逗号是file.softspace hack,以避免重复行

The code keeps only two lines in the memory and therefore it can support arbitrary large files. 该代码在内存中仅保留两行,因此可以支持任意大文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM