[英]Find certain/specific line-breaks while ignoring others
I have some SRT data that is coming back with \\r and \\n tags as line breaks in the middle of each sentence. 我有一些SRT数据,它们以\\ r和\\ n标签返回,作为每个句子中间的换行符。 How do I find only those \\r and \\n tags in the middle of the text/sentences and NOT the other ones that signify other line-breaks. 如何在文本/句子的中间只找到那些\\ r和\\ n标签,而不是其他表示其他换行符的标签。
Example source: 示例来源:
18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.
Desired output: 期望的输出:
18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange.
I am absolute crap at regex, but my best guess (to no avail) was something like 我是正则表达式的绝对废话,但我最好的猜测(无济于事)就像是
var reg = /[\\d\\r][a-zA-z0-9\\s+]+[\\r]/ var reg = / [\\ d \\ r] [a-zA-z0-9 \\ s +] + [\\ r] /
And then split() on that to remove any \\r in the middle of one of the values. 然后split()在其中删除其中一个值中间的任何\\ r \\ n。 I am sure that is not even close to the right way so...stackoverflow!! 我确信它甚至不是正确的方式所以... stackoverflow !! :) :)
This will match the line breaks you want to get rid of, capturing the character before and after it, to put those two back in place around a space: 这将匹配您想要摆脱的换行符,捕获它前后的字符,将这两个放回到空间周围:
var regex = /([a-z,.;:'"])(?:\r\n?|\n)([a-z])/gi;
str = str.replace(regex, '$1 $2');
Some things about the regular expression. 关于正则表达式的一些事情。 I used the modifiers i
and g
to make it case-insensitive and to find all line breaks in your string instead of stopping after the first one. 我使用修饰符i
和g
使它不区分大小写并找到字符串中的所有换行符而不是在第一个换行符后停止。 Also, it assumes that removable line breaks can occur after a letter, comma, period, semicolon, colon or single or double quotes and before another letter. 此外,它假定在字母,逗号,句号,分号,冒号或单引号或双引号之后以及另一个字母之前可以发生可移除的换行符。 As @nnnnnn mentioned in a comment above, this will not cover all possible sentences, but it should at least not choke on most punctuation. 正如@nnnnnn在上面的评论中提到的,这不会涵盖所有可能的句子,但它至少应该不会阻塞大多数标点符号。 The line break does have to be a single line break, but it is platform-independent (can be either \\r
, \\n
or \\r\\b
). 换行符必须是单行换行符,但它与平台无关(可以是\\r
, \\n
或\\r\\b
\\n
\\r\\b
)。 I capture both the character before the line break and the letter after the line break (with parentheses), so I can access them in the replacement string with $1
and $2
. 我捕获了换行符之前的字符和换行符后面的字母(带括号),因此我可以在$1
和$2
的替换字符串中访问它们。 That is basically all there is to it. 这基本上就是它的全部内容。
This regex should do the trick: 这个正则表达式应该做的伎俩:
/(\\d+\\r\\d{2}:\\d{2}:\\d{2},\\d{3} --> \\d{2}:\\d{2}:\\d{2},\\d{3}\\r)([^\\r]+)\\r([^\\r]+)(\\r|$)/g
To make this work with more lines (has to be a set number) then just add more ([^\\r]+)\\r
's. 为了使这个工作更多的线(必须是一个设定的数字),然后只需添加更多([^\\r]+)\\r
的。 (Remember to also add $
's to the match replace as so (with 3 lines): '$1$2 $3 $4\\r'
). (请记住还要将$
'添加到匹配替换中(使用3行): '$1$2 $3 $4\\r'
)。
mystring = mystring.replace(/(\d+\r\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r)([^\r]+)\r([^\r]+)(\r|$)/g, '$1$2 $3\r');
Works fine! 工作正常!
Input: 输入:
18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.
Output: 输出:
18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange
Doesn't work; 不起作用; more than 2 lines 超过2行
Input: 输入:
18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,
and they just talk
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.
Except for Maria who dyes it pink.
Output: 输出:
18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,
and they just talk
19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange.
Except for Maria who dyes it pink.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.