[英]Parsing relatively structured text files in python and inserting in mongodb
Testbed: ABC123 试验台:ABC123
Image : FOOBAR
Keyword: heredity
Date : 6/27
Other : XYZ suite crash
Suite : XYZ, crash post XYZ delivery
Failure:
Reason :
Known :
Failure:
Reason :
Known :
Type :
Notes :
Testbed: ABC456 试验台:ABC456
Image : FOOBAR
Keyword: isolate
Date :6/27
Other : 3 random failures in 3 different test suites
Suite : LMO Frag
Failure: jumbo_v4_to_v6
Reason : ?
Known : ?
Type :
Notes :
Suite : XYZ suite
Failure: XYZ_v4_to_v4v
Reason : failed to receive expected packets
Known : ?
Type :
Notes :
Suite : RST
Failure: RST_udp_v4_to_v6
Reason : failed to receive expected packets
Known : ?
Type :
Notes :
Image : BARFOO
Keyword: repugnat
Date : 6/26
Other :
Suite : PQR test
Failure: unable to destroy flow - flow created without ppx flow id
Reason : SCRIPT issue
Known : maybe?
Type : embtest
Notes :
Suite : UVW suite
Failure: 8 failures in UVW duplicate - interworking cases not working!
Reason : ?
Known : ?
Type :
Notes :
I am trying to create documents of the type 我正在尝试创建该类型的文档
{
"_id" : "xxxxxxxxxxxxx",
"platform" : "ABC123",
"image" : "FOOBAR",
"keyword" : "parricide",
"suite" : [
{
"name" : "RST (rst_only_v6v_to_v6)",
"notes" : "",
"failure" : "flow not added properly",
"reason" : "EMBTEST script issue",
"known" : "yes?",
"type" : ""
}
]
}
Where each document is unique based on the testbed, platform and image. 每个文档基于测试平台,平台和图像都是唯一的。
I have tried using regex and came up with something of this format but this is prone to human error in creating the structured text in which case this would fail due to its dependencies: 我已经尝试过使用正则表达式,并想出了这种格式的东西,但这在创建结构化文本时容易出现人为错误,在这种情况下,由于依赖关系而导致失败:
for iter in content:
if re.match(r"\s*testbed",iter,re.IGNORECASE):
testbed = iter.split(':')[1].strip()
if result_doc['platform'] == None:
result_doc['platform'] = testbed
if re.match(r"\s*image",iter,re.IGNORECASE):
image = iter.split(':')[1].strip()
if result_doc['image'] == None:
result_doc['image'] = image
if re.match(r"\s*keyword",iter,re.IGNORECASE):
keyword = iter.split(':')[1].strip()
if result_doc['keyword'] == None:
result_doc['keyword'] = keyword
key = str(testbed)+'-'+str(image)+'-'+str(keyword)
if prev_key == None:
prev_key = key
if key != prev_key: #if keys differ, then add to db
self.insert(result_doc)
prev_key = key
result_doc = self.getTemplate("result") #assign new document template
result_doc['platform'] = testbed
result_doc['image'] = image
result_doc['keyword'] = keyword
result_doc['_id'] = key
if re.match(r"\s*suite",iter,re.IGNORECASE):
suitename = iter.split(':')[1].strip()
if re.match(r"\s*Failure",iter,re.IGNORECASE):
suitefailure = iter.split(':')[1].strip()
result_suite = self.getTemplate("suite") # assign new suite template
result_suite['name'] = suitename
result_suite['failure'] = suitefailure
if re.match(r"\s*Reason",iter,re.IGNORECASE):
suitereason = iter.split(':')[1].strip()
result_suite['reason'] = suitereason
if re.match(r"\s*Known",iter,re.IGNORECASE):
suiteknown = iter.split(':')[1].strip()
result_suite['known'] = suiteknown
if re.match(r"\s*type",iter,re.IGNORECASE):
suitetype = iter.split(':')[1].strip()
result_suite['type'] = suitetype
if re.match(r"\s*Notes",iter,re.IGNORECASE):
suitenotes = iter.split(':')[1].strip()
result_suite['notes'] = suitenotes
result_doc['suite'].append(result_suite)
self.insert(result_doc) #Last document to be inserted
Is there a better way to do this than match on the next tag to create a new document?? 有没有比在下一个标签上匹配来创建新文档更好的方法了?
Thanks 谢谢
Yes there is definitely a better, more robust way to do this. 是的,肯定有一种更好,更强大的方法可以做到这一点。 One would use a hash table, or python "dictionary," to store the key value pairings provided in an input file and do some formatting to print them out in the desired output format. 可以使用哈希表或python“字典”来存储输入文件中提供的键值对,并进行某种格式化以所需的输出格式将其打印出来。
# Predefine some constants / inputs
testbed_dict = { "_id" : "xxxxxxxxxxxxx", "platform" : "ABC456" }
inputFile = "ABC456.txt"
with open(inputFile,"r") as infh:
inputLines = infh.readlines()
image_start_indices = [inputLines.index(x) for x in inputLines if x.split(":")[0].strip() == "Image"]
image_end_indices = [x-1 for x in image_start_indices[1:]]
image_end_indices.append(len(inputLines)-1)
image_start_stops = zip(image_start_indices, image_end_indices)
suite_start_indices = [i for i, x in enumerate(inputLines) if x.split(":")[0].strip() == "Suite"]
suite_end_indices = [i+1 for i, x in enumerate(inputLines) if x.split(":")[0].strip() == "Notes"]
suite_start_stops = zip(suite_start_indices,suite_end_indices)
for image_start_index, image_stop_index in image_start_stops:
suiteCount = 1
image_suite_indices, suites, image_dict = [], [], {}
for start, stop in suite_start_stops:
if start >= image_stop_index or image_start_index >= stop:
continue
image_suite_indices.append((start,stop))
suites = [inputLines[x:y] for x, y in image_suite_indices]
header_end_index = min([x for x, y in image_suite_indices])
for line in inputLines[image_start_index:header_end_index]:
if line.strip() == "":
continue
key, value = (line.split(":")[0].strip().lower(), line.split(":")[1].strip())
image_dict[key] = value
for suite in suites:
suite_dict = {}
for line in suite:
if line.strip() == "":
continue
key, value = (line.split(":")[0].strip().lower(), line.split(":")[1].strip())
suite_dict[key] = value
image_dict["suite "+str(suiteCount)] = suite_dict
suiteCount += 1
with open(image_dict["image"]+".txt","w") as outfh:
outfh.write('{\n')
for key, value in testbed_dict.iteritems():
outfh.write('\t"'+key+'" : "'+testbed_dict[key]+'"\n')
for key, value in image_dict.iteritems():
if 'suite' in key:
continue
else:
outfh.write('\t"'+key+'" : "'+value+'",\n')
for key, value in image_dict.iteritems():
if 'suite' not in key:
continue
else:
outfh.write('\t"suite" : [\n\t\t{\n')
for suitekey, suitevalue in value.iteritems():
outfh.write('\t\t\t"'+suitekey+'" : "'+str(suitevalue)+'",\n')
outfh.write("\t\t}\n")
outfh.write("\t],\n")
outfh.write('}\n')
The above code expects to be run in the same directory as an input file (ie ' inputFile = "ABC456.txt" '), and writes a variable number of output files depending on how many "images" are present in the input -- in the case of your ABC456 the outputs written would be "FOOBAR.txt" and "BARFOO.txt". 上面的代码希望与输入文件在同一目录中运行(即'inputFile =“ ABC456.txt”'),并根据输入中存在多少“图像”来写入可变数量的输出文件-对于您的ABC456,写入的输出将为“ FOOBAR.txt”和“ BARFOO.txt”。 For example, if "ABC456.txt" contains the text contents of the section "Testbed: ABC456" in your question above, then the outputs will be the following. 例如,如果“ ABC456.txt”包含上述问题中“测试床:ABC456”部分的文本内容,则输出将为以下内容。
BARFOO.txt BARFOO.txt
{
"platform" : "ABC456"
"_id" : "xxxxxxxxxxxxx"
"keyword" : "repugnat",
"image" : "BARFOO",
"other" : "",
"date" : "6/26",
"suite" : [
{
"notes" : "",
"failure" : "8 failures in UVW duplicate - interworking cases not working!",
"reason" : "?",
"known" : "?",
"suite" : "UVW suite",
"type" : "",
}
],
"suite" : [
{
"notes" : "",
"failure" : "unable to destroy flow - flow created without ppx flow id",
"reason" : "SCRIPT issue",
"known" : "maybe?",
"suite" : "PQR test",
"type" : "embtest",
}
],
}
FOOBAR.txt FOOBAR.txt
{
"platform" : "ABC456"
"_id" : "xxxxxxxxxxxxx"
"keyword" : "isolate",
"image" : "FOOBAR",
"other" : "3 random failures in 3 different test suites",
"date" : "6/27",
"suite" : [
{
"notes" : "",
"failure" : "RST_udp_v4_to_v6",
"reason" : "failed to receive expected packets",
"known" : "?",
"suite" : "RST",
"type" : "",
}
],
"suite" : [
{
"notes" : "",
"failure" : "XYZ_v4_to_v4v",
"reason" : "failed to receive expected packets",
"known" : "?",
"suite" : "XYZ suite",
"type" : "",
}
],
"suite" : [
{
"notes" : "",
"failure" : "jumbo_v4_to_v6",
"reason" : "?",
"known" : "?",
"suite" : "LMO Frag",
"type" : "",
}
],
}
The code above works but has some caveats -- it doesn't preserve ordering of the lines, but assuming you're just sticking this JSON into mongoDB certainly ordering doesn't matter. 上面的代码有效,但有一些警告-它不会保留行的顺序,但是假设您只是将此JSON粘贴到mongoDB中,那么顺序当然没关系。 Also you would need to modify it to handle some redundancies -- if the "Suite" line has redundant info nested under it (eg multiple "Failure" lines, like in your ABC123 example) all but one is ignored. 另外,您还需要修改它以处理一些冗余-如果“ Suite”行下面嵌套了冗余信息(例如,多条“ Failure”行,例如在您的ABC123示例中),则除一条以外的所有内容都将被忽略。 Hopefully you get a chance to look through the code, figure out how it's working, and modify it to meet whatever your needs are. 希望您有机会浏览一下代码,弄清楚代码的工作方式,并对其进行修改以满足您的任何需求。
Cheers. 干杯。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.