[英]Python parse text from multiple txt file
尋求有關如何從多個文本文件中挖掘項目以構建字典的建議。
此文本文件: https : //pastebin.com/Npcp3HCM
手動轉換為此必需的數據結構: https : //drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view
有數千個這樣的文本文件,它們可能有不同的章節標題,如下例所示:
我開始閱讀文件
from glob import glob
txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)
with open(txtFiles[0],'r') as tf:
allLines = [line.rstrip() for line in tf]
sectionHeading = ['Corporate Participants',
'Conference Call Participiants',
'Presentation',
'Questions and Answers']
for lineNum, line in enumerate(allLines):
if line in sectionHeading:
print(lineNum,allLines[lineNum])
我的想法是查找段標題存在的行號,並嘗試在這些行號之間提取內容,然后刪除像破折號一樣的分隔符。 這不起作用,我試圖創建這種字典,以便我以后可以在采石項目上運行各種自然語言處理算法。
{file-name1:{
{date-time:[string]},
{corporate-name:[string]},
{corporate-participants:[name1,name2,name3]},
{call-participants:[name4,name5]},
{section-headings:{
{heading1:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]}],
{heading2:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name1:[speechOrderNum, text-content]},
{name4:[speechOrderNum, text-content]}],
{heading3:[text-content]},
{heading4:[text-content]}
}
}
}
挑戰在於不同的文件可能有不同的標題和標題數量。 但總會有一個名為“演示文稿”的部分,很可能會有“問答”部分。 這些章節標題總是由一串相等的符號分隔。 不同說話者的內容總是用破折號串分開。 問答部分的“語音順序”用方括號中的數字表示。 參與者總是在文檔的開頭指示,在其名稱前面帶有星號,並且他們的圖塊始終在下一行。
任何關於如何解析文本文件的建議都表示贊賞。 理想的幫助是提供有關如何為每個文件生成這樣的字典(或其他合適的數據結構)的指導,然后可以將其寫入數據庫。
謝謝
- 編輯 -
其中一個文件如下所示: https : //pastebin.com/MSvmHb2e
其中“問答”部分被誤標為“演示文稿”,並且沒有其他“問答”部分。
最后的示例文本: https : //pastebin.com/jr9WfpV8
代碼中的注釋應該解釋一切。 如果有任何指定,請告訴我,並需要更多評論。
簡而言之,我利用正則表達式找到'='分隔符行來將整個文本細分為子部分,然后為了清楚起見分別處理每種類型的部分(這樣你就可以告訴我如何處理每個案例)。
旁注:我正在互換使用'參與者'和'作者'這個詞。
編輯:更新了代碼,以根據演示文稿/質量保證部分中與會者/作者旁邊的“[x]”模式進行排序。 還改變了漂亮的打印部分,因為pprint不能很好地處理OrderedDict。
剝去任何額外的空格包括\\n
的串中的任意位置,只需做str.strip()
如果你只需要剝離\\n
,那么只需要str.strip('\\n')
。
我修改了代碼以去除會話中的任何空格。
import json
import re
from collections import OrderedDict
from pprint import pprint
# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
# asdfasdfasdf
# sdfasdfdfsdfsdf
# =============================
# asdfsdfasdfasd
# =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
equ_pattern = re.compile(regex, re.MULTILINE)
sections = equ_pattern.split(lines)
sections = [section.strip('\n') for section in sections]
return sections
# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):
subsections = subdivide(section, "^-+")
heading = subsections[0] # header of the section.
d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
index_pattern = re.compile("\[(.+)\]", re.MULTILINE)
# sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
# Then this is passed as a compare function to 'sorted' to sort based on 'x'.
def cmp(d):
mat = index_pattern.findall(d[0])
if mat:
print(mat[0])
return int(mat[0])
# There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
# This is just to deal with that small issue.
else:
return 0
o_d = OrderedDict(sorted(d.items(), key=cmp))
return heading, o_d
# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' ,
# ... 'ms. woman ceo of company [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
new_d = OrderedDict()
for key, value in d.items():
a = [a for a in attendees if a in key]
if len(a) == 1:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[a[0]] = value.strip()
elif len(a) == 0:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[key] = value.strip()
return new_d
if __name__ == '__main__':
with open('input.txt', 'r') as input:
lines = input.read()
# regex pattern for matching headers of each section
header_pattern = re.compile("^.*[^\n]", re.MULTILINE)
# regex pattern for matching the sections that contains
# the list of attendee's (those that start with asterisks )
ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)
# regex pattern for matching sections with subsections in them.
dash_pattern = re.compile("^-+", re.MULTILINE)
ppl_d = dict()
talks_d = dict()
# Step1. Divide the the entire document into sections using the '=' divider
sections = subdivide(lines, "^=+")
header = []
print(sections)
# Step2. Handle each section like a switch case
for section in sections:
# Handle headers
if len(section.split('\n')) == 1: # likely to match only a header (assuming )
header = header_pattern.match(section).string
# Handle attendees/authors
elif ppl_pattern.match(section):
ppls = ppl_pattern.findall(section)
d = {key.strip(): value.strip() for (_, key, value) in ppls}
ppl_d.update(d)
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: ppl_d})
# Handle subsections
elif dash_pattern.findall(section):
heading, d = process_dashed_sections(section)
talks_d.update({heading: d})
# Else its just some random text.
else:
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: section})
#pprint(talks_d)
# To assign the talks material to the appropriate attendee/author. Still works if no match found.
for key, value in talks_d.items():
talks_d[key] = assign_attendee(value, ppl_d.keys())
# ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
print(json.dumps(talks_d, indent=4))
您能否確認一下,您是否只需要“演示”和“問答”部分? 此外,關於輸出,可以轉儲類似於“手動轉換”的CSV格式。
更新的解決方案適用於您提供的每個示例文件。
#state = ["other", "head", "present", "qa", "speaker", "data"]
# codes : 0, 1, 2, 3, 4, 5
def writecell(out, data):
out.write(data)
out.write(",")
def readfile(fname, outname):
initstate = 0
f = open(fname, "r")
out = open(outname, "w")
head = ""
head_written = 0
quotes = 0
had_speaker = 0
for line in f:
line = line.strip()
if not line: continue
if initstate in [0,5] and not any([s for s in line if "=" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 1
elif initstate in [0,5] and not any([s for s in line if "-" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 4
elif initstate == 1 and line == "Presentation":
initstate = 2
head = "Presentation"
head_written = 0
elif initstate == 1 and line == "Questions and Answers":
initstate = 3
head = "Questions and Answers"
head_written = 0
elif initstate == 1 and not any([s for s in line if "=" != s]):
initstate = 0
elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]):
initstate = 4
elif initstate == 4 and '[' in line and ']' in line:
comma = line.find(',')
speech_st = line.find('[')
speech_end = line.find(']')
if speech_st == -1:
initstate = 0
continue
if comma == -1:
firm = ""
speaker = line[:speech_st].strip()
else:
speaker = line[:comma].strip()
firm = line[comma+1:speech_st].strip()
head_written = 1
if head_written:
writecell(out, head)
head_written = 0
order = line[speech_st+1:speech_end]
writecell(out, speaker)
writecell(out, firm)
writecell(out, order)
had_speaker = 1
elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]):
if had_speaker:
initstate = 5
out.write('"')
quotes = 1
had_speaker = 0
elif initstate == 5:
line = line.replace('"', '""')
out.write(line)
elif initstate == 0:
continue
else:
continue
f.close()
if quotes:
out.write('"')
out.close()
readfile("Sample1.txt", "out1.csv")
readfile("Sample2.txt", "out2.csv")
readfile("Sample3.txt", "out3.csv")
在這個解決方案中有一個狀態機,其工作方式如下:1。檢測是否存在航向,如果是,則寫入2.在航向寫入后檢測揚聲器3.為該揚聲器寫入音符4.切換到下一個揚聲器等等...
您可以隨后根據需要處理csv文件。 完成基本處理后,您還可以以任何所需格式填充數據。
編輯:
請更換“writecell”功能
def writecell(out, data):
data = data.replace('"', '""')
out.write('"')
out.write(data)
out.write('"')
out.write(",")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.