簡體   English   中英

僅當空格后面沒有其他新行時才用空格替換新行(撤消文本中的硬包裝)

[英]replace new line with space only if it isn't followed by another new line (undo hard wrap in text)

我有一堆帶有硬線包裝的文本文件(即大約80個字符的新行)。 我想撤消這一點並將所有這些句子加在一起,但保留新的行,它們是新的章節或段落。

即,當且僅當以下字符不是另一個'\\ n'時,我喜歡將'\\ n'替換為''

下面的python代碼做了我想要的,但不是非常有效,我寧願用正則表達式和/或sed做。

s = open(filename, 'r').read()
p = s.split('\n\n') # split into paragraphs
p = [x.replace('\n', ' ') for x in p] # iterate all paragraphs, replace \n
s2 = '\n\n'.join(p) # join paragraphs back together

例如

Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.

應該成為:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

更新

我試過並在5MB文本文件上定時下面的5個python方法。 令我驚訝的是,所有3個正則表達式方法都比python split / replace / join方法慢一個數量級。

def m1(s):
    p = s.split('\n\n') # split into paragraphs
    p = [x.replace('\n', ' ') for x in p] # iterate all paragraphs, replace \n
    r = '\n\n'.join(p) # join paragraphs back together
    return r

def m2(s):
    r = re.sub(r"(?<!\n)\n(?!\n)", " ", s)
    return r

def m3(s):
    p = re.compile(ur'(?<!^)\n(?=\S)', re.MULTILINE)
    r = re.sub(p, u" ", s)
    return r

def m4(s):
    r = "".join(["".join(v) if k else " ".join(map(str.strip, v))+"\n"  for k, v in groupby(s, str.isspace)])
    return r


def repl(m):
    return (' ' if len(m.group(1))==1 else m.group(1)) + m.group(2)
def m5(s):
    r = re.sub(r'(\n+)(.)', repl, s)
    return r

結果:

np.array( timeit.repeat('r=m1(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[4]: array([ 0.01343679,  0.0136183 ,  0.0153013 ,  0.0122381 ,  0.01205051])

np.array( timeit.repeat('r=m2(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[5]: array([ 0.10881839,  0.108728  ,  0.10904381,  0.10862441,  0.10867569])

np.array( timeit.repeat('r=m3(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[6]: array([ 0.1358021 ,  0.1352592 ,  0.13556101,  0.1357465 ,  0.1354876 ])

np.array( timeit.repeat('r=m4(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[7]: array([ 2.51403842,  2.37821078,  2.4169096 ,  2.56688828,  2.36240571])

np.array( timeit.repeat('r=m5(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[8]: array([ 0.16381941,  0.1616353 ,  0.1620033 ,  0.1617353 ,  0.1615443 ])

使用re.sub()然后你必須使用負面的后視和前瞻聲明。 如果您的輸入很大,這將不是非常有效。

后視:

(?<!...)
     Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

預見:

(?!...)
     Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if
 it’s not followed by 'Asimov'. 

這是一個例子:

>>> text = """Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus."""

>>> re.sub(r"(?<!\n)\n(?!\n)", " ", text)
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.\n\nMauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.\n\nMaecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.'

>>> print(_)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

您可以使用awk ,如下所示:

awk '{$1=$1}1' RS='' ORS='\n\n' OFS=' ' file

說明:

  • {$1=$1}看起來不會改變任何東西。 這是真的,但仍然awk將使用新的分隔符重新組裝記錄(下面看)

  • 1總是計算為true,因為沒有指定任何動作,awk將打印整個當前記錄

  • RS=''輸入記錄分隔符中 空字符串是一個特殊值。 它表示按空行拆分記錄,按新行拆分字段。

  • ORS='\\n\\n'設置輸出記錄分隔符也為空行。

  • OFS=' '輸出字段分隔符設置為空格。

輸出:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

你可以使用groupby,在空白上分組:

from itertools import groupby

with open("test.txt") as f:
    print("".join(["".join(v) if k else " ".join(map(str.strip, v))+"\n"  for k, v in groupby(f, str.isspace)]))

哪個會給你:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

我試着在python中使用正則表達式:

假設text變量包含示例文本

import re
p = re.compile(ur'(?<!^)\n(?=\S)', re.MULTILINE)

result = re.sub(p, u" ", text)
print(result)

它將打印以下文本,用空格替換單個換行符。

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

請參閱regex101上的演示

有時復雜的替換可以通過將函數作為第二個參數傳遞給re.sub()

import re

ipsum = '''Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.

Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.

Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.
'''

ipsum = re.sub(
    r'(\n+)(?=.)',
    lambda m: ' ' if len(m.group(1))==1 else m.group(1),
    ipsum)

print ipsum

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM