在python中解析相对结构化的文本文件并插入mongodb

Question

Testbed: ABC123 试验台：ABC123

Image  : FOOBAR
Keyword: heredity
Date   : 6/27
Other  : XYZ suite crash

     Suite  : XYZ, crash post XYZ delivery
         Failure: 
         Reason : 
         Known  :

         Failure: 
         Reason : 
         Known  : 

         Type   :

         Notes  :

Testbed: ABC456 试验台：ABC456

Image  : FOOBAR
Keyword: isolate
Date   :6/27
Other  : 3 random failures in 3 different test suites

     Suite  : LMO Frag
         Failure: jumbo_v4_to_v6 
         Reason : ?
         Known  : ?

         Type   :

         Notes  : 

    Suite  : XYZ suite
         Failure:  XYZ_v4_to_v4v
         Reason : failed to receive expected packets
         Known  : ?

         Type   :

         Notes  : 

    Suite  : RST
         Failure: RST_udp_v4_to_v6 
         Reason : failed to receive expected packets
         Known  : ?

         Type   :

         Notes  : 

Image  : BARFOO
Keyword: repugnat
Date   : 6/26
Other  : 

     Suite  : PQR test
         Failure: unable to destroy flow - flow created without ppx flow id
         Reason : SCRIPT issue
         Known  : maybe?

         Type   : embtest

         Notes  : 

    Suite  : UVW suite
         Failure:  8 failures in UVW duplicate - interworking cases not working!
         Reason : ?
         Known  : ?

         Type   :

         Notes  :

I am trying to create documents of the type 我正在尝试创建该类型的文档

{
        "_id" : "xxxxxxxxxxxxx",
        "platform" : "ABC123",
        "image" : "FOOBAR",
        "keyword" : "parricide",
        "suite" : [
                {
                        "name" : "RST (rst_only_v6v_to_v6)",
                        "notes" : "",
                        "failure" : "flow not added properly",
                        "reason" : "EMBTEST script issue",
                        "known" : "yes?",
                        "type" : ""
                }
        ]
}

Where each document is unique based on the testbed, platform and image. 每个文档基于测试平台，平台和图像都是唯一的。

I have tried using regex and came up with something of this format but this is prone to human error in creating the structured text in which case this would fail due to its dependencies: 我已经尝试过使用正则表达式，并想出了这种格式的东西，但这在创建结构化文本时容易出现人为错误，在这种情况下，由于依赖关系而导致失败：

        for iter in content:
            if re.match(r"\s*testbed",iter,re.IGNORECASE):
                testbed = iter.split(':')[1].strip()
                if result_doc['platform'] == None:
                    result_doc['platform'] = testbed 

            if re.match(r"\s*image",iter,re.IGNORECASE):
                image = iter.split(':')[1].strip()
                if result_doc['image'] == None:
                    result_doc['image'] = image

            if re.match(r"\s*keyword",iter,re.IGNORECASE):
                keyword = iter.split(':')[1].strip()
                if result_doc['keyword'] == None:
                    result_doc['keyword'] = keyword 
                key = str(testbed)+'-'+str(image)+'-'+str(keyword)
                if prev_key == None:
                    prev_key = key
                if key != prev_key: #if keys differ, then add to db
                    self.insert(result_doc)
                    prev_key = key
                    result_doc = self.getTemplate("result") #assign new document template
                    result_doc['platform'] = testbed 
                    result_doc['image'] = image
                    result_doc['keyword'] = keyword
                result_doc['_id'] = key 

            if re.match(r"\s*suite",iter,re.IGNORECASE):
                suitename = iter.split(':')[1].strip()

            if re.match(r"\s*Failure",iter,re.IGNORECASE):
                suitefailure = iter.split(':')[1].strip()
                result_suite = self.getTemplate("suite") # assign new suite template
                result_suite['name'] = suitename
                result_suite['failure'] = suitefailure

            if re.match(r"\s*Reason",iter,re.IGNORECASE):
                suitereason = iter.split(':')[1].strip()
                result_suite['reason'] = suitereason

            if re.match(r"\s*Known",iter,re.IGNORECASE):
                suiteknown = iter.split(':')[1].strip()
                result_suite['known'] = suiteknown

            if re.match(r"\s*type",iter,re.IGNORECASE):
                suitetype = iter.split(':')[1].strip()
                result_suite['type'] = suitetype

            if re.match(r"\s*Notes",iter,re.IGNORECASE):
                suitenotes = iter.split(':')[1].strip()
                result_suite['notes'] = suitenotes
                result_doc['suite'].append(result_suite)

        self.insert(result_doc) #Last document to be inserted

Is there a better way to do this than match on the next tag to create a new document?? 有没有比在下一个标签上匹配来创建新文档更好的方法了？

Thanks 谢谢

Answer 1

Yes there is definitely a better, more robust way to do this. 是的，肯定有一种更好，更强大的方法可以做到这一点。 One would use a hash table, or python "dictionary," to store the key value pairings provided in an input file and do some formatting to print them out in the desired output format. 可以使用哈希表或python“字典”来存储输入文件中提供的键值对，并进行某种格式化以所需的输出格式将其打印出来。

# Predefine some constants / inputs
testbed_dict = { "_id" : "xxxxxxxxxxxxx", "platform" : "ABC456" }
inputFile = "ABC456.txt"

with open(inputFile,"r") as infh:
  inputLines = infh.readlines()
  image_start_indices = [inputLines.index(x) for x in inputLines if x.split(":")[0].strip() == "Image"]
  image_end_indices = [x-1 for x in image_start_indices[1:]]
  image_end_indices.append(len(inputLines)-1)
  image_start_stops = zip(image_start_indices, image_end_indices)
  suite_start_indices = [i for i, x in enumerate(inputLines) if x.split(":")[0].strip() == "Suite"]
  suite_end_indices = [i+1 for i, x in enumerate(inputLines) if x.split(":")[0].strip() == "Notes"]
  suite_start_stops = zip(suite_start_indices,suite_end_indices)
  for image_start_index, image_stop_index in image_start_stops:
    suiteCount = 1
    image_suite_indices, suites, image_dict = [], [], {}
    for start, stop in suite_start_stops:
      if start >= image_stop_index or image_start_index >= stop:
        continue
      image_suite_indices.append((start,stop))
    suites = [inputLines[x:y] for x, y in image_suite_indices]
    header_end_index = min([x for x, y in image_suite_indices])
    for line in inputLines[image_start_index:header_end_index]:
      if line.strip() == "":
        continue
      key, value = (line.split(":")[0].strip().lower(), line.split(":")[1].strip())
      image_dict[key] = value
    for suite in suites:
      suite_dict = {}
      for line in suite:
        if line.strip() == "":
          continue
        key, value = (line.split(":")[0].strip().lower(), line.split(":")[1].strip())
        suite_dict[key] = value
      image_dict["suite "+str(suiteCount)] = suite_dict
      suiteCount += 1
    with open(image_dict["image"]+".txt","w") as outfh:
      outfh.write('{\n')
      for key, value in testbed_dict.iteritems():
        outfh.write('\t"'+key+'" : "'+testbed_dict[key]+'"\n')
      for key, value in image_dict.iteritems():
        if 'suite' in key:
          continue
        else:
          outfh.write('\t"'+key+'" : "'+value+'",\n')
      for key, value in image_dict.iteritems():
        if 'suite' not in key:
          continue
        else:
          outfh.write('\t"suite" : [\n\t\t{\n')
          for suitekey, suitevalue in value.iteritems():
            outfh.write('\t\t\t"'+suitekey+'" : "'+str(suitevalue)+'",\n')
          outfh.write("\t\t}\n")
          outfh.write("\t],\n")
      outfh.write('}\n')

The above code expects to be run in the same directory as an input file (ie ' inputFile = "ABC456.txt" '), and writes a variable number of output files depending on how many "images" are present in the input -- in the case of your ABC456 the outputs written would be "FOOBAR.txt" and "BARFOO.txt". 上面的代码希望与输入文件在同一目录中运行（即'inputFile =“ ABC456.txt”'），并根据输入中存在多少“图像”来写入可变数量的输出文件-对于您的ABC456，写入的输出将为“ FOOBAR.txt”和“ BARFOO.txt”。 For example, if "ABC456.txt" contains the text contents of the section "Testbed: ABC456" in your question above, then the outputs will be the following. 例如，如果“ ABC456.txt”包含上述问题中“测试床：ABC456”部分的文本内容，则输出将为以下内容。

BARFOO.txt BARFOO.txt

{
    "platform" : "ABC456"
    "_id" : "xxxxxxxxxxxxx"
    "keyword" : "repugnat",
    "image" : "BARFOO",
    "other" : "",
    "date" : "6/26",
    "suite" : [
        {
            "notes" : "",
            "failure" : "8 failures in UVW duplicate - interworking cases not working!",
            "reason" : "?",
            "known" : "?",
            "suite" : "UVW suite",
            "type" : "",
        }
    ],
    "suite" : [
        {
            "notes" : "",
            "failure" : "unable to destroy flow - flow created without ppx flow id",
            "reason" : "SCRIPT issue",
            "known" : "maybe?",
            "suite" : "PQR test",
            "type" : "embtest",
        }
    ],
}

FOOBAR.txt FOOBAR.txt

{
    "platform" : "ABC456"
    "_id" : "xxxxxxxxxxxxx"
    "keyword" : "isolate",
    "image" : "FOOBAR",
    "other" : "3 random failures in 3 different test suites",
    "date" : "6/27",
    "suite" : [
        {
            "notes" : "",
            "failure" : "RST_udp_v4_to_v6",
            "reason" : "failed to receive expected packets",
            "known" : "?",
            "suite" : "RST",
            "type" : "",
        }
    ],
    "suite" : [
        {
            "notes" : "",
            "failure" : "XYZ_v4_to_v4v",
            "reason" : "failed to receive expected packets",
            "known" : "?",
            "suite" : "XYZ suite",
            "type" : "",
        }
    ],
    "suite" : [
        {
            "notes" : "",
            "failure" : "jumbo_v4_to_v6",
            "reason" : "?",
            "known" : "?",
            "suite" : "LMO Frag",
            "type" : "",
        }
    ],
}

The code above works but has some caveats -- it doesn't preserve ordering of the lines, but assuming you're just sticking this JSON into mongoDB certainly ordering doesn't matter. 上面的代码有效，但有一些警告-它不会保留行的顺序，但是假设您只是将此JSON粘贴到mongoDB中，那么顺序当然没关系。 Also you would need to modify it to handle some redundancies -- if the "Suite" line has redundant info nested under it (eg multiple "Failure" lines, like in your ABC123 example) all but one is ignored. 另外，您还需要修改它以处理一些冗余-如果“ Suite”行下面嵌套了冗余信息（例如，多条“ Failure”行，例如在您的ABC123示例中），则除一条以外的所有内容都将被忽略。 Hopefully you get a chance to look through the code, figure out how it's working, and modify it to meet whatever your needs are. 希望您有机会浏览一下代码，弄清楚代码的工作方式，并对其进行修改以满足您的任何需求。

Cheers. 干杯。

在python中解析相对结构化的文本文件并插入mongodb

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-07-04 09:42:51

在python中解析相对结构化的文本文件并插入mongodb

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-07-04 09:42:51

解决方案1
0 已采纳 2014-07-04 09:42:51