Extract JSON data from simple bullet list using python and RegEx?

Question

I am writing a quiz app and am looking to extract multiple choice questions as JSON from a data.txt file, which contains the questions + answers in a simple nested bullet list format as follows:

data.txt:

 1. Der Begriff Aerodynamik steht für Luft und Bewegung.
 Was entsteht, wenn man sich durch die Luft bewegt?
 a) Die Reibung der Luftteilchen am Körper verursacht einen Luftwiderstand.
 b) Je höher die Geschwindigkeit durch die Luft ist, desto mehr Luftkraft entsteht.
 c) Die Luft bekommt merklich Substanz und wirkt mit ihrer Kraft
 auf die ihr gebotene Form/Fläche ein.
 d) Alle Antworten sind richtig.
 2. Welcher Effekt wird durch Luftströmung um einen Körper mit einer der folgenden Formen erzeugt?
 a) Eine Tropfenform hat einen geringen Luftwiderstand.
 b) Eine hohle Halbkugel (Rundkappenreserve) hat einen hohen Luftwiderstand.
 c) Ein Flächenfallschirmprofil setzt die Luftströmung durch Sog- und Druckwirkung
 in Auftriebsenergie um.
 ...

(NOTE: there may be line breaks in both the questions and the answers)

The desired JSON format I would like to extract is the following:

[
  {
    "number": 1,
    "question": "Der Begriff Aerodynamik steht für Luft und Bewegung. Was entsteht, wenn man sich durch die Luft bewegt?",
    "a": "Die Reibung der Luftteilchen am Körper verursacht einen Luftwiderstand.",
    "b": "Je höher die Geschwindigkeit durch die Luft ist, desto mehr Luftkraft entsteht.",
    "c": "Die Luft bekommt merklich Substanz und wirkt mit ihrer Kraft  auf die ihr gebotene Form/Fläche ein.",
    "d": "Alle Antworten sind richtig."
  },
  {
    "number": 2,
    "question": "Welcher Effekt wird durch Luftströmung um einen Körper mit einer der folgenden Formen erzeugt?",
    "a": "Eine Tropfenform hat einen geringen Luftwiderstand.",
    "b": "Eine hohle Halbkugel (Rundkappenreserve) hat einen hohen Luftwiderstand.",
    "c": "Ein Flächenfallschirmprofil setzt die Luftströmung durch Sog- und Druckwirkung in Auftriebsenergie um.",
    "d": "Alle Antworten sind richtig."
  }
]

I was hoping to be able to do this with a simple python script , by reading my data.txt and using RegEx matches to get the data and convert to JSON accordingly, and writing that back to a file.

I looked into regular expressions but have a hard time figuring out which RegExs I need to get the matches for converting the data to my JSON format.

Does any one know which RegEx I am looking for? Or is there a better approach to extracting the question data as JSON from the data.txt file?

If it was much simpler I would also be happy with a JSON format that matches the simple nested data structure of the original bullet list format more directly..

Thanks a lot.

Answer 1

So the in the comments suggested regExs helped, i ended up with the following regEx solutions to help with my answer...

\\n\\d{1,3}\\.\\s to match the question numbers, eg 1. (only works for question numbers that dont exceed 3 digits, ie max. 999)

and

\\n[ad]{1}\\)\\s to match the multiple choice answers, eg a)

Since nobody had the perfect few lines of code that I was hoping for I ended up writing a more obvious solution, where I break the string into substrings using the RegExs, add those substrings to Lists/Arrays and then convert that result to JSON. The script I ended up with is the following:

 # coding=utf-8
import re
import json

#------------------------

firstQuestionNumber = 1

filename = 'data'
fileextension = '.txt'

#------------------------

f = open(filename + fileextension, 'r')
s = f.read()

# replace question numbers and answer letters with $$$ and ### so it can easily be split later
# recover the question and answer numbers by order of appearance (assuming continuous numbering)

s1 = re.sub("\n\d{1,3}\.\s","$$$", s)
s2 = re.sub("\n[a-d]{1}\)\s","###", s1)

questionList = [] 

questions = s2.split("$$$")

for question in questions:

    questionNumber = questions.index(question)

    if questionNumber!=0:
        questionSplits = question.split("###")

        questionData = {}
        questionData["number"] = questionNumber - 1 + firstQuestionNumber
        questionData["question"] = questionSplits[0]
        questionData["a"] = questionSplits[1]
        questionData["b"] = questionSplits[2]
        questionData["c"] = questionSplits[3]
        questionData["d"] = questionSplits[4]

        questionList.append(questionData)


json_data = json.dumps(questionList)

f = open(filename+'_json'+'.txt', 'w')
f.write(json_data)

thanks for giving me hints to come up with this solution.

Extract JSON data from simple bullet list using python and RegEx?

Question

1 answers

solution1
0 ACCPTED 2016-11-28 13:25:22

Extract JSON data from simple bullet list using python and RegEx?

Question

1 answers

solution1 0 ACCPTED 2016-11-28 13:25:22

solution1
0 ACCPTED 2016-11-28 13:25:22