简体   繁体   English

如何在python中通过\\ r \\ n进行正则表达式

[英]how to regex by \r\n in python

I have the text looks like this: 我的文字看起来像这样:

1
00:00:01,860 --> 00:00:31,210
Affil of fifth at fat at all the social ball and said, with all this little in the

2
00:00:31,210 --> 00:01:03,060
mid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss

3
00:01:03,980 --> 00:01:33,150
of our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case

4
00:01:33,950 --> 00:02:03,190
will almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in

5
00:02:04,240 --> 00:02:24,180
it and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that

I would like to extract ONLY the sentences from the text. 我想只从文本中提取句子。 such as "Affil of fifth at fat at all the social ball and said, with all this little in the mid limited and will cost a lot, for want of a ...." 比如“在所有的社交球中加入第五个胖子并表示,所有这些在中间有限,并且由于缺乏......而且会花费很多。”

so the raw text is look like this: 所以原始文本看起来像这样:

  "1\r\n00:00:01,860 --> 00:00:31,210\r\nAffil of fifth at fat at all the social ball and said, with all this little in the\r\n\r\n2\r\n00:00:31,210 --> 00:01:03,060\r\nmid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss\r\n\r\n3\r\n00:01:03,980 --> 00:01:33,150\r\nof our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case\r\n\r\n4\r\n00:01:33,950 --> 00:02:03,190\r\nwill almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in\r\n\r\n5\r\n00:02:04,240 --> 00:02:24,180\r\nit and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that\r\n\r\n"

by checking the raw text, we might separate the text by "\\r\\n" something like this but I do not know how to write the regex. 通过检查原始文本,我们可能将文本分隔为“\\ r \\ n”这样的东西,但我不知道如何编写正则表达式。

Why not simply get every fourth line, starting from the third? 为什么不从第三行开始简单地获得每四行? Then you can join on a space. 然后你可以加入一个空间。

text = '''1
00:00:01,860 --> 00:00:31,210
Affil of fifth at fat at all the social ball and said, with all this little in the

2
00:00:31,210 --> 00:01:03,060
mid limited and will cost a lot, for want of a lot of it is I never do this or below are the innocent of fat in the annual own none will bit less often were a little the earth the oven for the area of some of them some of the atom in the long will recall the law, will cost you the ball a little less of Odessa and coal rule the Vikings in at a loss

3
00:01:03,980 --> 00:01:33,150
of our lady of one of the will of the wall routing visiting little sign of the limited use of a lot of wind up with a loss of 14 and uncivil will find a site to lop off call them into solid, a London, can we stop go to work as a gay sailor kissing a lot of that scene of the law that on them in this case

4
00:01:33,950 --> 00:02:03,190
will almost a kind wilkinson's, and that a settlement, or the fog collared of the unknown, some would call and all of this was a little, some of us up a lot of letters, union would quit them or not will be or will lend money to zoning and will open the door to that of the novel opens in

5
00:02:04,240 --> 00:02:24,180
it and solidity can cut later with boats can die to only see not open only to six and 0:50 and world go back a at the fat of that at that'''
t = ' '.join(text.splitlines()[2::4])

Result: 结果:

>>> import textwrap
>>> for line in textwrap.wrap(t, width=50):
...     print(line)
...
Affil of fifth at fat at all the social ball and
said, with all this little in the mid limited and
will cost a lot, for want of a lot of it is I
never do this or below are the innocent of fat in
the annual own none will bit less often were a
little the earth the oven for the area of some of
them some of the atom in the long will recall the
law, will cost you the ball a little less of
Odessa and coal rule the Vikings in at a loss of
our lady of one of the will of the wall routing
visiting little sign of the limited use of a lot
of wind up with a loss of 14 and uncivil will find
a site to lop off call them into solid, a London,
can we stop go to work as a gay sailor kissing a
lot of that scene of the law that on them in this
case will almost a kind wilkinson's, and that a
settlement, or the fog collared of the unknown,
some would call and all of this was a little, some
of us up a lot of letters, union would quit them
or not will be or will lend money to zoning and
will open the door to that of the novel opens in
it and solidity can cut later with boats can die
to only see not open only to six and 0:50 and
world go back a at the fat of that at that
(?<=[\r\n])[a-zA-Z].*

You can use re.findall with this regex.See demo. 你可以在这个regex.See演示中使用re.findall

https://regex101.com/r/QIwQ9z/1 https://regex101.com/r/QIwQ9z/1

To make this more error free, use 为了使这更加无错误,请使用

(?:\d+:){2}\d+,\d+ --> (?:\d+:){2}\d+,\d+[\r\n]+([^\n]+)

See demo. 见演示。

https://regex101.com/r/QIwQ9z/2 https://regex101.com/r/QIwQ9z/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM