簡體   English   中英

使用Python xlsxwriter模塊將srt數據寫入Excel

[英]Use Python xlsxwriter module to write srt data into and excel

這次,我嘗試使用Python的xlsxwriter模塊將數據從.srt寫入Excel。

字幕文件如下圖所示:

但我想將數據寫入Excel,所以看起來像這樣:

這是我第一次為此編寫python代碼,因此我仍處於試錯階段...我試圖編寫如下代碼

但我認為這沒有道理...

我將繼續嘗試,但是如果您知道該怎么做,請告訴我。 我將閱讀您的代碼並嘗試理解它們! 謝謝! :)

以下內容將問題分為幾部分:

  • 解析輸入文件。 parse_subtitles是一個生成器 ,它使用行源並產生{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'形式的記錄序列{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}' 我采用的方法是跟蹤我們處於三種不同狀態中的哪一種:
    1. 當我們正在尋找下一個索引號時,尋找seeking to next entry ,該索引號應與正則表達式^\\d*$匹配(只有一堆數字)
    2. looking for timestamp找到索引時looking for timestamp ,我們希望時間戳出現在下一行,該時間戳應與正則表達式^\\d{2}:\\d{2}:\\d{2},\\d{3} --> \\d{2}:\\d{2}:\\d{2},\\d{3}$匹配^\\d{2}:\\d{2}:\\d{2},\\d{3} --> \\d{2}:\\d{2}:\\d{2},\\d{3}$ (HH:MM:SS,mmm-> HH:MM:SS,mmm)和
    3. reading subtitles同時消耗實際的字幕文本,其中空行和EOF被解釋為字幕終止點。
  • 將以上記錄寫入工作表中的一行。 write_dict_to_worksheet接受行和工作表,以及一條記錄和一個字典,該字典為每個記錄的鍵定義了Excel 0索引的列號,然后適當地寫入了數據。
  • 組織整個轉換convert接受一個輸入文件名(例如,將打開並傳遞給parse_subtitles函數的'Wildlife.srt'和一個輸出文件名(例如,將使用xlsxwriter創建的'Subtitle.xlsx' 。),然后編寫一個標題,並針對從輸入文件中解析出的每個記錄, 將該記錄寫入XLSX文件

記錄剩下的語句是為了進行自我注釋,並且因為在復制輸入文件時,我會用a :指向a ; 在一個時間戳記中,使其無法識別,並且彈出錯誤提示對於調試非常方便!

我在此Gist中將您的源文件的文本版本以及以下代碼放入其中

import xlsxwriter
import re
import logging

def parse_subtitles(lines):
    line_index = re.compile('^\d*$')
    line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
    line_seperator = re.compile('^\s*$')

    current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
    state = 'seeking to next entry'

    for line in lines:
        line = line.strip('\n')
        if state == 'seeking to next entry':
            if line_index.match(line):
                logging.debug('Found index: {i}'.format(i=line))
                current_record['index'] = line
                state = 'looking for timestamp'
            else:
                logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))

        elif state == 'looking for timestamp':
            if line_timestamp.match(line):
                logging.debug('Found timestamp: {t}'.format(t=line))
                current_record['timestamp'] = line
                state = 'reading subtitles'
            else:
                logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))

        elif state == 'reading subtitles':
            if line_seperator.match(line):
                logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
                yield current_record
                state = 'seeking to next entry'
                current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
            else:
                logging.debug('Appending to subtitle: {s}'.format(s=line))
                current_record['subtitles'].append(line)

        else:
            logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
    if state == 'reading subtitles':
        # We must have finished the file without encountering a blank line. Dump the last record
        yield current_record

def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
    """
    Write a subtitle-record to a worksheet. 
    Return the row number after those that were written (since this may write multiple rows)
    """
    current_row = row
    #First, horizontally write the entry and timecode
    for (colname, colindex) in columns_for_keys.items():
        if colname != 'subtitles': 
            worksheet.write(current_row, colindex, keyed_data[colname])

    #Next, vertically write the subtitle data
    subtitle_column = columns_for_keys['subtitles']
    for morelines in keyed_data['subtitles']:
        worksheet.write(current_row, subtitle_column, morelines)
        current_row+=1

    return current_row

def convert(input_filename, output_filename):
    workbook = xlsxwriter.Workbook(output_filename)
    worksheet = workbook.add_worksheet('subtitles')
    columns = {'index':0, 'timestamp':1, 'subtitles':2}

    next_available_row = 0
    records_processed = 0
    headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
    next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)

    with open(input_filename) as textfile:
        for record in parse_subtitles(textfile):
            next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
            records_processed += 1

    print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
    workbook.close()

convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')

編輯 :更新以將多行字幕拆分為輸出中的多行

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM