[英]Use Python xlsxwriter module to write srt data into and excel
這次,我嘗試使用Python的xlsxwriter模塊將數據從.srt寫入Excel。
字幕文件如下圖所示:
但我想將數據寫入Excel,所以看起來像這樣:
這是我第一次為此編寫python代碼,因此我仍處於試錯階段...我試圖編寫如下代碼
但我認為這沒有道理...
我將繼續嘗試,但是如果您知道該怎么做,請告訴我。 我將閱讀您的代碼並嘗試理解它們! 謝謝! :)
以下內容將問題分為幾部分:
parse_subtitles
是一個生成器 ,它使用行源並產生{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'
形式的記錄序列{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'
。 我采用的方法是跟蹤我們處於三種不同狀態中的哪一種:
seeking to next entry
,該索引號應與正則表達式^\\d*$
匹配(只有一堆數字) looking for timestamp
找到索引時looking for timestamp
,我們希望時間戳出現在下一行,該時間戳應與正則表達式^\\d{2}:\\d{2}:\\d{2},\\d{3} --> \\d{2}:\\d{2}:\\d{2},\\d{3}$
匹配^\\d{2}:\\d{2}:\\d{2},\\d{3} --> \\d{2}:\\d{2}:\\d{2},\\d{3}$
(HH:MM:SS,mmm-> HH:MM:SS,mmm)和 reading subtitles
同時消耗實際的字幕文本,其中空行和EOF被解釋為字幕終止點。 write_dict_to_worksheet
接受行和工作表,以及一條記錄和一個字典,該字典為每個記錄的鍵定義了Excel 0索引的列號,然后適當地寫入了數據。 convert
接受一個輸入文件名(例如,將打開並傳遞給parse_subtitles
函數的'Wildlife.srt'
和一個輸出文件名(例如,將使用xlsxwriter
創建的'Subtitle.xlsx'
。),然后編寫一個標題,並針對從輸入文件中解析出的每個記錄, 將該記錄寫入XLSX文件 。 記錄剩下的語句是為了進行自我注釋,並且因為在復制輸入文件時,我會用a :
指向a ;
在一個時間戳記中,使其無法識別,並且彈出錯誤提示對於調試非常方便!
我在此Gist中將您的源文件的文本版本以及以下代碼放入其中
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
編輯 :更新以將多行字幕拆分為輸出中的多行
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.