繁体   English   中英

从字符串中删除特定句子

[英]Remove specific sentences from string

我有以下格式的字符串:(有 3 个或更多空格的句子和这些句子之间的句子是表格数据的一部分)

Some Sentence
Some sentence


Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -

Some paragraph
Some paragraph

Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -

我需要从包含 3 个或更多空格的字符串中删除所有句子,记住应该保留实际的段落内容。

下面是我的方法,它没有给我准确的结果,我也不喜欢使用 range(5):

for i in range(5):
result = re.sub('[\\n-].* {3,}.*\\n', '', result)
print(result)

我的逻辑输出:

Some Sentence
Some sentence


Additions based on tax positions related to the
Additions based on tax positions related to prior



Some paragraph
Some paragraph

Additions based on tax positions related to the
Additions based on tax positions related to prior


预期输出:

Some Sentence
Some sentence


Some paragraph
Some paragraph



还可以做些什么来删除句子之间的句子(有 3 个或更多空格)?

sentences = """
Some Sentence
Some sentence


Additions based on tax positions related to the
Additions based on tax positions related to prior



Some paragraph
Some paragraph

Additions based on tax positions related to the
Additions based on tax positions related to prior
"""

splitted_sentences = sentences.split('\n')

only_short_sentences = [line for line in splitted_sentences if len(line.split()) <3]
short_sentences_str = '\n'.join(only_short_sentences)
print(short_sentences_str)

输出:

Some Sentence
Some sentence





Some paragraph
Some paragraph

如果你想丢弃空行 - 转换为这个版本的列表理解:

only_short_sentences = [line for line in splitted_sentences if len(line.split()) <3 and line]

这是预期的结果吗?

已编辑

输入:

sentences = """
Some Sentence
Some sentence


Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -

Some paragraph
Some paragraph

Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -
"""

输出:

Some Sentence
Some sentence






Some paragraph
Some paragraph

对此有一个简单的正则表达式(我已将您的输入放入文件“test.txt”中):

grep -v " .* .* " test.txt

如您所见,它只是三个空格,中间是".*" ,它代表“每个可能的字符,重复未知次数(可能为零)”。
哦,差点忘了:在"-v"代表“的事情不要在结果中看到”。

显然您知道re Python 库,因此您可能知道如何将此正则表达式嵌入到您的 Python 源代码中。

祝你好运

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM