简体   繁体   English

Python 无法识别数字

[英]Python fails to recognize a digit

I have an input file such as this and the program removes everything but the hindi text.我有一个这样的输入文件,程序会删除除印地语文本之外的所有内容。

1
00:00:10,240 --> 00:00:13,824
विकास नाम का एक गरीब मजदूर था

2
00:00:14,592 --> 00:00:15,360
जो सेठ

3
00:00:15,616 --> 00:00:16,896
भीमसेन के यहां

Here is my program这是我的程序

#!/usr/bin/python
# -*- coding:utf-8 -*-

import sys
import re
import string
import codecs

def del_brackets(s):
        a = re.compile(r'\<.*?\>')
        result = a.sub('', s)
        return result.strip('\n').strip()

def del_brackets2(s):
        a = re.compile(r'\[.*?\]')
        result = a.sub('', s)
        return result.strip('\n').strip()

def del_brackets3(s):
        a = re.compile(r'\{.*?\}')
        result = a.sub('', s)
        return result.strip('\n').strip()

def del_brackets4(s):
        a = re.compile(r'\(.*?\)')
        result = a.sub('', s)
        return result.strip('\n').strip()

with open(sys.argv[1], 'r') as f:
    lines = f.readlines()

outfile = open(sys.argv[1].replace('.srt', '.txt'), 'w')

exclude = set('♪"#$%&\()*+-/:<=>@[\\]^_`{|}')
for line in lines:
#   print(repr(line))
    line = line.strip()
    #line = unicode(line.strip('\n'), 'utf-8')
    if len(line.strip()) != 0 and line != 1 and line != "1":
        if (not line.isdigit()) and ('-->' not in line):
            line = del_brackets(line)
            line = del_brackets2(line)
            line = del_brackets3(line)
            line = del_brackets4(line)
            line = ' '.join(''.join(' ' if ch in exclude else ch for ch in line).split())
            line = re.sub(r'\.\.\.', ' ', line)
            outfile.write(line.lstrip() + "\n")

outfile.close()

and the expected output is below并且预期的 output 低于

विकास नाम का एक गरीब मजदूर था
जो सेठ
भीमसेन के यहां

However, my program doesn't recognize the first line digit, and instead it returns但是,我的程序无法识别第一行数字,而是返回

1
विकास नाम का एक गरीब मजदूर था
जो सेठ
भीमसेन के यहां

Why does this program doesn't recognize the digit when I specifically wrote 1 or "1"?为什么当我专门写 1 或“1”时,这个程序无法识别数字?

Using regex we can create a simple expression that covers the three cases that you want to ignore:使用正则表达式,我们可以创建一个简单的表达式,涵盖您要忽略的三种情况:

  1. timestamp line时间戳行
  2. number line数线
  3. empty line空行

From there we can use python's built-in filter method to filter out all of the undesired lines, and use the filter results as the lines to write.从那里我们可以使用python内置的filter方法来过滤掉所有不需要的行,并将filter结果用作要写入的行。

import sys, re

def pruneSRTtoTXT(fn):
    fn2    = fn.replace('.srt', '.txt')
    stamp  = '[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}'
    ignore = re.compile(f'^({stamp}\s-->\s{stamp}|[0-9]+|[\r\n]+)$', re.M)

    with open(fn, 'r') as f, open(fn2, 'w') as f2:
        f2.writelines(filter(lambda l: not ignore.search(l), f.readlines()))

pruneSRTtoTXT(sys.argv[1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM