简体   繁体   中英

Extract questions and answers with regex

I want to extract some questions and answers from some files I'm reading but my regex isn't working for me:

from re import findall,DOTALL

text='''
category 1
1. question
a) answer
b) answer
2. question
a) answer
b) answer

category 2
3. question
a) answer
b) answer
'''

The format in the files is basically a numbered list with a variable number of indexed answers like a) b) or a. b. ... with the answers spanning several lines in places. I've tried this:

mo=findall(r"^\d\.(.+)(\w\)|\.(.+))+$",text,DOTALL)
print(mo)

I tried putting in capture groups to separate the questions from the answers, removing "^" gives the closest result but it's still junk and I don't understand why this happens:

[(' question\na) answer\nb) answer\n2. question\na) answer\nb) answer\ncategory 2\n3', '. question\na) answer\nb) answer\n', ' question\na) answer\nb) answer\n')]

I'm considering looking for a space between the answers in order to not pick up the "category" junk as a part of the answer or controlling my input more to support a format with no space as well.

I'm trying to get an output like(doesn't need to be a tuple, that's just what findall groups return):

[('question', 'answer', 'answer'), 
 ('question', 'answer', 'answer'), 
 ('question', 'answer', 'answer')]

Instead of writing a monster regex for a multiline requirement, I'd use normal iteration and accumulation, more or less. You can split on /\\n(?=[az\\d][).] )/gm to extract the Q&A content only. Iterating over these chunks, if any are questions, start a new Q&A block, otherwise append to the existing one to accumulate the result.

import re

text = '''
category 1
1. q1
  q1 foobar
a) a1.a
b) a1.b
  some extra a1.b
2. q2
a) a2.a
b) a2.b
  some extra a2.b
c) a2.c
 blah a2.c

category 2
3. q3
a) a3.a
b) a3.b
extra a3.b
'''

qa = []
block = []

for chunk in re.split(r"\n(?=[a-z\d][).] )", text):
    if m:= re.match(r"\d+\. (.+)", chunk, re.S):
        qa.append(tuple(block))
        block = [m.group(1)]
    elif m := re.match(r"[a-z]+\) (.+?)(?=\n\n|$|[a-z]+\) )", chunk, re.S):
        block.append(m.group(1))

qa = qa[1:] + [tuple(block)]

for line in qa: 
    print(line)

Gives:

('q1\n  q1 foobar', 'a1.a', 'a1.b\n  some extra a1.b')
('q2', 'a2.a', 'a2.b\n  some extra a2.b', 'a2.c\n blah a2.c')
('q3', 'a3.a', 'a3.b\nextra a3.b')

Regex explanations:

  • /\\n(?=[az\\d][).] )/gs does the splitting on newlines that lookahead to either of the two a) or 1. patterns. This enables us to preserve the multiline chunks.
  • /\\d+\\. (.+)/gs /\\d+\\. (.+)/gs lets us identify a 1. question chunk and capture the question body.
  • /[az]+\\) (.+?)(?=\\n\\n|$|[az]+\\) )/gs matches the a) answer chunk. It's pretty much the same as the 1. question chunk above, but it has to be a bit careful to trim the next content header, which wasn't handled by regex (1) above. This is what the (?=\\n\\n|$|[az]+\\) ) lookahead does: if the following is a double newline, end of string or a) , then don't include it in this answer.

One simpler approach would be splitting each line and applying a regex , for example:

import re


text='''
category 1
1. question
a) answer
b) answer
2. question
a) answer
b) answer

category 2
3. question
a) answer
b) answer
'''

question = re.compile(r'^\d+\.\s(.+)')
answer = re.compile(r'^[a-z]\)\s(.+)')

output = []
for line in text.splitlines():
    if question.match(line):
        output.append(question.findall(line))
    elif answer.match(line):
        output[-1].append(answer.findall(line)[0])

print(output)
>>> [['question', 'answer', 'answer'], ['question', 'answer', 'answer'], ['question', 'answer', 'answer']]  

For multiline questions and answers you may try this:

text='''
category 1
1. question 1
a) answer 1a
   second line of answer 1a
b) answer 1b
2. question 2
a) answer 2a
b) answer 2b
   second line of answer 2b

category 2
3. question 3
   second line of question 3
a) answer 3a
   second line of answer 3a
b) answer 3b
   second line of answer 3b
'''

quiz = []
for category in re.split("\n\n", text):
    qa = re.findall(r"^\d+\.\s+(.*?)(^[a-z][).](?:[^\n]|\n(?!\d))*)", category, re.DOTALL | re.MULTILINE)
    for question, answers in qa:       
        quiz.append((question.strip(), *re.findall(r"^[a-z][).]\s+((?:[^\n]|\n(?![a-z]))*)", answers, re.MULTILINE)))

print (quiz)

The output is

[('question 1', 'answer 1a\n   second line of answer 1a', 'answer 1b'), ('question 2', 'answer 2a', 'answer 2b\n   second line of answer 2b'), ('question 3\n   second line of question 3', 'answer 3a\n   second line of answer 3a', 'answer 3b\n   second line of answer 3b\n')]

Since there is no specification in the question how to handle lines ending/spaces in the multiline questions and answers it's hard to understand if this output satisfies the requirements or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM