简体   繁体   English

Matching.srt 文件字幕行和时间戳与正则表达式

[英]Matching .srt file subtitle line and timestamps with regex

As the title states, I want to match the timestamp and text lines of a.srt file subtitles.如标题所述,我想匹配 a.srt 文件字幕的时间戳和文本行。

some of these files are not formatted properly, so I need something to work with almost all of them.其中一些文件格式不正确,所以我需要一些东西来处理几乎所有文件。

the correct formatting of a file is like this:文件的正确格式是这样的:

1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?

2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.

3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?

4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.

and the regex pattern that I came up with is working very well for this kind of files.我想出的正则表达式模式非常适用于此类文件。

as I said, some of the files are not formatted properly, some of them don't have the line number, some of them don't have a new line after each subtitle line and the regex that I came up with does not work properly for those.正如我所说,有些文件格式不正确,有些没有行号,有些在每个字幕行后没有换行,我想出的正则表达式不能正常工作对于那些。

There are other questions like this that have already been answered, but I want to match each timestamp and text line in a separate matching-group.还有其他类似的问题已经得到解答,但我想在单独的匹配组中匹配每个时间戳和文本行。 so my groups for the first line of the mentioned example would be something like this:所以我在提到的示例的第一行的组是这样的:

group 1: 00:00:02,160第 1 组: 00:00:02,160

group 2: 00:00:04,994第 2 组: 00:00:04,994

group 3: You really don't remember\nwhat happened last year?第三组: You really don't remember\nwhat happened last year?

this is what I've got so far:这是我到目前为止所得到的:

LINE_RE = (
    # group 1:
    r"^\s*(\d+:\d+:\d+,\d+)"  # line starts with any number of whitespace
    # and followed by a time format like 00:00:00,000
    r"(?:\s*-{2,3}>\s*)"  # non-matching group for ' --> '
    # matches one or more of - follwed by a >
    # group 2:
    r"(\d+:\d+:\d+,\d+)\s*\n"  # time format again,
    # ended with any number of whitespace and a \n
    # group 3:
    r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
    # matches any character, until it hits an empty line, a line with only a number in it or a timestamp

)

I think my exact problem is in the last non-matching group, it does not work properly when the next line is not an empty line.我认为我的确切问题出在最后一个不匹配的组中,当下一行不是空行时它无法正常工作。

this is an example file, I did some mangling in the file so I could show the problem better.是一个示例文件,我在文件中做了一些修改,以便更好地显示问题。

In that case, you can match the lines that start with a timestamp like pattern, and capture all lines that do not start with either a newline and a single digit or another timestamp like pattern.在这种情况下,您可以匹配以类似模式的时间戳开头的行,并捕获不以换行符和单个数字或其他类似模式的时间戳开头的所有行。

^\s*(\d+:\d+:\d+,\d+)[^\S\n]+-->[^\S\n]+(\d+:\d+:\d+,\d+)((?:\n(?!\d+:\d+:\d+,\d+\b|\n+\d+$).*)*)

The pattern in parts matches:部分中的模式匹配:

  • ^ Start of string ^字符串开始
  • \s* Match optional whitspace chars \s*匹配可选的空白字符
  • (\d+:\d+:\d+,\d+) Capture group 1 , match a timestamp like pattern (\d+:\d+:\d+,\d+)捕获组 1 ,匹配类似时间戳的模式
  • [^\S\n]+-->[^\S\n]+ Match --> between 1 or more spaces [^\S\n]+-->[^\S\n]+匹配--> 1个或多个空格之间
  • (\d+:\d+:\d+,\d+) Capture group 2 , same pattern as for group 1 (\d+:\d+:\d+,\d+)捕获组 2 ,与组 1 的模式相同
  • ( Capture group 3 (捕获组 3
    • (?: Non capture group - \n Match a newline (?: Non capture group - \n匹配换行符
      • (?! Negative lookahead, assert what is to the right is not (?! Negative lookahead, assert what is the right is not
        • \d+:\d+:\d+,\d+\b|\n+\d+$ Match either a timestamp or 1+ newlines followed by only digits \d+:\d+:\d+,\d+\b|\n+\d+$匹配时间戳或 1+ 个换行符后跟仅数字
      • ) Close lookahead )关闭前瞻
      • .* Match the whole line .*匹配整行
    • )* Close the non capture group and optionally repeat it )*关闭非捕获组并有选择地重复它
  • ) Close group 3 )关闭组 3

See a regex demo .请参阅正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM