简体   繁体   中英

Ignoring an element when appending to a csv file in Python

I have a srt file such as this

355
00:52:44,533 --> 00:52:51,467
Og så er der selvfølgelig masser af valg både her på <initial> P </initial> et og på nettet og på <initial> DR </initial> et i løbet af dagen og i aften. Godt valg.

356
S1 00:52:54,733 --> 00:53:01,933
Du kan finde alle <initial> P </initial> et programmer på dr punktum dk skråstreg <initial> P </initial> et. Det giver mening.

S1 is an speaker id but not every section of my srt file has this. So I would like to not put Speaker in my csv file.

however my code below adds the speaker id anyway although if there is no S#

for example S4 below,

filename.csv;211.03300000000002;218.833;S4;Det at at beslutte sig er jo ikke kun at beslutte sig for hvilket parti det jo også først og fremmest beslutte sig om vil man stemme, vil man ikke stemme, det vil de fleste jo så.
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import re
import csv

SRTFILE = sys.argv[1]
CSVFILE = re.sub(r'\.srt$', '.csv', SRTFILE)
BASEFILE = re.sub(r'\.srt$', '', SRTFILE)

if CSVFILE == SRTFILE:
    sys.exit('check the srt suffix')

with open(SRTFILE, 'r') as fid:
    lines = fid.readlines()

newLine = False
transcript = []
captionStart = False
speaker = ''
t1 = 0
t2 = 0
for line in lines:
    line = line.strip()
    if re.match(r'^[0-9]+$', line):
        newLine = True
        continue
    if re.match(r'^$', line):
        if captionStart and len(transcript) > 0:
            continue
            print '%s;%1.3f;%1.3f;%s;;%s'%(BASEFILE, t1, t2, speaker, ' '.join(transcript))
        newLine = False
        transcript = []
        continue
    matchobj = re.match(r'^([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
    if matchobj:
        t1 = int(matchobj.group(1))*3600.0 + int(matchobj.group(2))*60.0 + float(re.sub(r',', '.', matchobj.group(3)))
        t2 = int(matchobj.group(4))*3600.0 + int(matchobj.group(5))*60.0 + float(re.sub(r',', '.', matchobj.group(6)))
        captionStart = True
        continue
    else:
        matchobj = re.match(r'^([a-zA-Z0-9]+) +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
        if matchobj:
            t1 = int(matchobj.group(2))*3600.0 + int(matchobj.group(3))*60.0 + float(re.sub(r',', '.', matchobj.group(4)))
            t2 = int(matchobj.group(5))*3600.0 + int(matchobj.group(6))*60.0 + float(re.sub(r',', '.', matchobj.group(7)))
            speaker = matchobj.group(1)
            captionStart = True
            continue
    if newLine:
        transcript.append(line)
    if speaker:
        print(CSVFILE, t1, t2, speaker, line)
        if speaker:
                new_list = [CSVFILE, t1, t2, speaker, line]
                print(CSVFILE, t1, t2, speaker, line)
                with open(CSVFILE, 'a') as fid:
                    writer = csv.writer(fid, delimiter=';')
                    writer.writerow(new_list)
    else:
        print(CSVFILE, t1, t2, line)
            new_list = [CSVFILE, t1, t2, speaker,'\;', line]
            with open(CSVFILE, 'a') as fid:
                writer = csv.writer(fid, delimiter=';')
                writer.writerow(new_list)

please let me know how to fix this.

(I apologize for the following question you can just ignore it) I also have a simple question. I would like to then format my csv file as following

filename;starttime;endtime;speaker;;transcripts

where it has two semicolons before transcripts or line in my code. I tried

new_list = [CSVFILE, t1, t2, speaker, ";",line]

in my code but it adds the quotation marks around it.

how do I achieve filename;starttime;endtime;speaker;;transcripts with ;;before line

I would like to not put Speaker in my csv file. however my code below adds the speaker id anyway although if there is no S#

You're explicitly adding the speaker when you assemble your row:

new_list = [CSVFILE, t1, t2, speaker,'\;', line]

So just don't do that. When it adds a speaker even though there isn't a speaker, what's happening is that it's using the last value assigned to speaker . After each chunk, you should be resetting that variable: speaker = None .

how do I achieve filename;starttime;endtime;speaker;;transcripts with ;;before line

2 delimiters side by side means that there is an empty field, so just put a None in the appropriate place in your list of fields: [filename, starttime, endtime, speaker, None, transcripts]) An empty string would also work.

But I thought you were trying to remove the speaker field. So wouldn't it be [filename, starttime, endtime, None, transcripts] ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM