I have a srt file such as this
355
00:52:44,533 --> 00:52:51,467
Og så er der selvfølgelig masser af valg både her på <initial> P </initial> et og på nettet og på <initial> DR </initial> et i løbet af dagen og i aften. Godt valg.
356
S1 00:52:54,733 --> 00:53:01,933
Du kan finde alle <initial> P </initial> et programmer på dr punktum dk skråstreg <initial> P </initial> et. Det giver mening.
S1
is an speaker id but not every section of my srt file has this. So I would like to not put Speaker in my csv file.
however my code below adds the speaker id anyway although if there is no S#
for example S4 below,
filename.csv;211.03300000000002;218.833;S4;Det at at beslutte sig er jo ikke kun at beslutte sig for hvilket parti det jo også først og fremmest beslutte sig om vil man stemme, vil man ikke stemme, det vil de fleste jo så.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import re
import csv
SRTFILE = sys.argv[1]
CSVFILE = re.sub(r'\.srt$', '.csv', SRTFILE)
BASEFILE = re.sub(r'\.srt$', '', SRTFILE)
if CSVFILE == SRTFILE:
sys.exit('check the srt suffix')
with open(SRTFILE, 'r') as fid:
lines = fid.readlines()
newLine = False
transcript = []
captionStart = False
speaker = ''
t1 = 0
t2 = 0
for line in lines:
line = line.strip()
if re.match(r'^[0-9]+$', line):
newLine = True
continue
if re.match(r'^$', line):
if captionStart and len(transcript) > 0:
continue
print '%s;%1.3f;%1.3f;%s;;%s'%(BASEFILE, t1, t2, speaker, ' '.join(transcript))
newLine = False
transcript = []
continue
matchobj = re.match(r'^([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
if matchobj:
t1 = int(matchobj.group(1))*3600.0 + int(matchobj.group(2))*60.0 + float(re.sub(r',', '.', matchobj.group(3)))
t2 = int(matchobj.group(4))*3600.0 + int(matchobj.group(5))*60.0 + float(re.sub(r',', '.', matchobj.group(6)))
captionStart = True
continue
else:
matchobj = re.match(r'^([a-zA-Z0-9]+) +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
if matchobj:
t1 = int(matchobj.group(2))*3600.0 + int(matchobj.group(3))*60.0 + float(re.sub(r',', '.', matchobj.group(4)))
t2 = int(matchobj.group(5))*3600.0 + int(matchobj.group(6))*60.0 + float(re.sub(r',', '.', matchobj.group(7)))
speaker = matchobj.group(1)
captionStart = True
continue
if newLine:
transcript.append(line)
if speaker:
print(CSVFILE, t1, t2, speaker, line)
if speaker:
new_list = [CSVFILE, t1, t2, speaker, line]
print(CSVFILE, t1, t2, speaker, line)
with open(CSVFILE, 'a') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(new_list)
else:
print(CSVFILE, t1, t2, line)
new_list = [CSVFILE, t1, t2, speaker,'\;', line]
with open(CSVFILE, 'a') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(new_list)
please let me know how to fix this.
(I apologize for the following question you can just ignore it) I also have a simple question. I would like to then format my csv file as following
filename;starttime;endtime;speaker;;transcripts
where it has two semicolons before transcripts or line
in my code. I tried
new_list = [CSVFILE, t1, t2, speaker, ";",line]
in my code but it adds the quotation marks around it.
how do I achieve filename;starttime;endtime;speaker;;transcripts with ;;
before line
I would like to not put Speaker in my csv file. however my code below adds the speaker id anyway although if there is no S#
You're explicitly adding the speaker when you assemble your row:
new_list = [CSVFILE, t1, t2, speaker,'\;', line]
So just don't do that. When it adds a speaker even though there isn't a speaker, what's happening is that it's using the last value assigned to speaker
. After each chunk, you should be resetting that variable: speaker = None
.
how do I achieve filename;starttime;endtime;speaker;;transcripts with ;;before line
2 delimiters side by side means that there is an empty field, so just put a None in the appropriate place in your list of fields: [filename, starttime, endtime, speaker, None, transcripts])
An empty string would also work.
But I thought you were trying to remove the speaker field. So wouldn't it be [filename, starttime, endtime, None, transcripts]
?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.