简体   繁体   中英

How to combine every two adjoining lines in Chinese txt file into one line with Python

I have a Chinese txt file with thousands of sentence lines as following,

  1. line 1
  2. line 2
  3. line 3
  4. line 4

…………

I want to combine every two adjoining lines into one line,it should be transformed as:

  1. line 1 + space + line 2
  2. line 3 + space + line 4
  3. line 5 + space + line 6 …………

How can I use Python to finish the combination?

You don't need Python for that, sed is enough:

$ seq 15 > lines
$ cat lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ sed 'N;s/\n/ /g' lines
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15

According to man sed:

n N Read/append the next line of input into the pattern space.

and

s/regexp/replacement/

Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \\1 through \\9 to refer to the corresponding matching sub-expressions in the regexp.

And, as sed execute the given script for each line, the newline character is not included in the pattern space (it would be redundant to include it). So the executed sequence is:

  • sed loads a line in the pattern space
  • N : Append the next line to the pattern space, now that we have two lines in the pattern space, they have to be separated by a newline, so we have a newline character in the middle of the pattern space
  • s/\\n/ / replace the newline character by a space
  • sed now print the line as there's nothing more to do on this line
  • And starts again with the next line
  1. You would read the file and obtain a list of lines (ie list of strings)
  2. then you could use a list comprehension, like this one:

    [ l1 + ' ' + l2 for l1,l2 in zip(lines[::2], lines[1::2]) ]

Note, this means you'll have to have an equal number of lines. so if len(lines)%2==1 then use lines[-1] to print out/use the last line by itself

You should iterate on your file like follows:

with open('./chinese.txt') as my_file:
    for line in my_file:
        try:
            print '{} {}'.format(line.strip(), my_file.next())
        except StopIteration:  # Manage case: number of lines is an odd number
            print line

A file is an iterator over lines in Python. You could use the itertools' grouper() recipe, to group the lines into pairs :

#!/usr/bin/env python2
from itertools import izip_longest

with open('Chinese.txt') as file:
    for line, another in izip_longest(file, file, fillvalue=''):
        print line.rstrip('\n'), another,

The comma at the end of the print statement is the file.softspace hack, to avoid duplicating newlines .

The code keeps only two lines in the memory and therefore it can support arbitrary large files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM