使用python和RegEx从简单的项目符号列表中提取JSON数据？

Question

I am writing a quiz app and am looking to extract multiple choice questions as JSON from a data.txt file, which contains the questions + answers in a simple nested bullet list format as follows: 我正在编写测验应用程序，并希望从data.txt文件中以JSON格式提取多项选择问题，该文件以简单的嵌套项目符号列表格式包含问题和答案，如下所示：

data.txt: data.txt：

 1. Der Begriff Aerodynamik steht für Luft und Bewegung.
 Was entsteht, wenn man sich durch die Luft bewegt?
 a) Die Reibung der Luftteilchen am Körper verursacht einen Luftwiderstand.
 b) Je höher die Geschwindigkeit durch die Luft ist, desto mehr Luftkraft entsteht.
 c) Die Luft bekommt merklich Substanz und wirkt mit ihrer Kraft
 auf die ihr gebotene Form/Fläche ein.
 d) Alle Antworten sind richtig.
 2. Welcher Effekt wird durch Luftströmung um einen Körper mit einer der folgenden Formen erzeugt?
 a) Eine Tropfenform hat einen geringen Luftwiderstand.
 b) Eine hohle Halbkugel (Rundkappenreserve) hat einen hohen Luftwiderstand.
 c) Ein Flächenfallschirmprofil setzt die Luftströmung durch Sog- und Druckwirkung
 in Auftriebsenergie um.
 ...

(NOTE: there may be line breaks in both the questions and the answers) （注意：问题和答案中可能都有换行符）

The desired JSON format I would like to extract is the following: 我想提取的所需JSON格式如下：

[
  {
    "number": 1,
    "question": "Der Begriff Aerodynamik steht für Luft und Bewegung. Was entsteht, wenn man sich durch die Luft bewegt?",
    "a": "Die Reibung der Luftteilchen am Körper verursacht einen Luftwiderstand.",
    "b": "Je höher die Geschwindigkeit durch die Luft ist, desto mehr Luftkraft entsteht.",
    "c": "Die Luft bekommt merklich Substanz und wirkt mit ihrer Kraft  auf die ihr gebotene Form/Fläche ein.",
    "d": "Alle Antworten sind richtig."
  },
  {
    "number": 2,
    "question": "Welcher Effekt wird durch Luftströmung um einen Körper mit einer der folgenden Formen erzeugt?",
    "a": "Eine Tropfenform hat einen geringen Luftwiderstand.",
    "b": "Eine hohle Halbkugel (Rundkappenreserve) hat einen hohen Luftwiderstand.",
    "c": "Ein Flächenfallschirmprofil setzt die Luftströmung durch Sog- und Druckwirkung in Auftriebsenergie um.",
    "d": "Alle Antworten sind richtig."
  }
]

I was hoping to be able to do this with a simple python script , by reading my data.txt and using RegEx matches to get the data and convert to JSON accordingly, and writing that back to a file. 我希望能够使用一个简单的python脚本来做到这一点，方法是读取我的data.txt并使用RegEx匹配项来获取数据并相应地转换为JSON，然后将其写回到文件中。

I looked into regular expressions but have a hard time figuring out which RegExs I need to get the matches for converting the data to my JSON format. 我研究了正则表达式，但是很难弄清楚我需要获取哪些RegEx才能将数据转换为JSON格式的匹配项。

Does any one know which RegEx I am looking for? 有谁知道我要寻找的RegEx吗？ Or is there a better approach to extracting the question data as JSON from the data.txt file? 还是有更好的方法从data.txt文件中将问题数据提取为JSON？

If it was much simpler I would also be happy with a JSON format that matches the simple nested data structure of the original bullet list format more directly.. 如果要简单得多，我也会更喜欢与原始项目符号列表格式的简单嵌套数据结构更直接匹配的JSON格式。

Thanks a lot. 非常感谢。

Answer 1

So the in the comments suggested regExs helped, i ended up with the following regEx solutions to help with my answer... 因此，在建议使用regExs的评论中，我最终得到了以下regEx解决方案以帮助我解答...

\\n\\d{1,3}\\.\\s to match the question numbers, eg 1. (only works for question numbers that dont exceed 3 digits, ie max. 999) \\n\\d{1,3}\\.\\s以匹配问题编号，例如1. （仅适用于不超过3位数字的问题编号，即最大999）

and 和

\\n[ad]{1}\\)\\s to match the multiple choice answers, eg a) \\n[ad]{1}\\)\\s以匹配多项选择答案，例如a)

Since nobody had the perfect few lines of code that I was hoping for I ended up writing a more obvious solution, where I break the string into substrings using the RegExs, add those substrings to Lists/Arrays and then convert that result to JSON. 由于没有人希望我能找到几行完美的代码，因此我最终写了一个更明显的解决方案，即使用RegEx将字符串分成子字符串，将这些子字符串添加到Lists / Arrays，然后将结果转换为JSON。 The script I ended up with is the following: 我最终得到的脚本如下：

 # coding=utf-8
import re
import json

#------------------------

firstQuestionNumber = 1

filename = 'data'
fileextension = '.txt'

#------------------------

f = open(filename + fileextension, 'r')
s = f.read()

# replace question numbers and answer letters with $$$ and ### so it can easily be split later
# recover the question and answer numbers by order of appearance (assuming continuous numbering)

s1 = re.sub("\n\d{1,3}\.\s","$$$", s)
s2 = re.sub("\n[a-d]{1}\)\s","###", s1)

questionList = [] 

questions = s2.split("$$$")

for question in questions:

    questionNumber = questions.index(question)

    if questionNumber!=0:
        questionSplits = question.split("###")

        questionData = {}
        questionData["number"] = questionNumber - 1 + firstQuestionNumber
        questionData["question"] = questionSplits[0]
        questionData["a"] = questionSplits[1]
        questionData["b"] = questionSplits[2]
        questionData["c"] = questionSplits[3]
        questionData["d"] = questionSplits[4]

        questionList.append(questionData)


json_data = json.dumps(questionList)

f = open(filename+'_json'+'.txt', 'w')
f.write(json_data)

thanks for giving me hints to come up with this solution. 感谢您给我提示提出此解决方案的提示。

使用python和RegEx从简单的项目符号列表中提取JSON数据？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-11-28 13:25:22

使用python和RegEx从简单的项目符号列表中提取JSON数据？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-11-28 13:25:22

解决方案1
0 已采纳 2016-11-28 13:25:22