简体   繁体   中英

Using regex to export data in PDF file to excel

I am using regex to get certain strings in a PDF file and write them to an excel file. The content of my PDF file is as follows:

Task 1: Question 1? answer1
Task 2: Question 2? (Format:****) answer2
Task 3: Question 3? answer3
Task 4: Question 4? (Format:*****) answer4

What I want to do is ignore the parts that say (Format:****) .., for others the regex works fine, how can I do that?, so excel should be like below.

Excel

here my code:

import re
import pandas as pd
from pdfminer.high_level import extract_pages, extract_text

text = extract_text("file.pdf")

pattern1 = re.compile(r":\s*(.*\?)")
pattern2 = re.compile(r".*\?\s*(.*)")
matches1 = pattern1.findall(text)
matches2 = pattern2.findall(text)
df = pd.DataFrame({'Soru-TR': matches1})
df['Cevap'] = matches2
writer = pd.ExcelWriter('Questions.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()

You can use a single pattern with 2 capture groups, and optionally match a part between parenthesis after matching the question mark.

^[^:]*:\s*([^?]+\?)\s+(?:\([^()]*\)?\s)?(.*)

Explanation

  • ^ Start of string
  • [^:]*: Match any char except : and then match :
  • \s* Match optional whitespace cahrs
  • ([^?]+\?) Capture group 1, match 1+ chars other than ? and then match ?
  • \s+ Match 1+ whitspace chars
  • (?:\([^()]*\)?\s)? Optionally match from an opening till closing (...)
  • (.*) Capture group 2, match the rest of the line

See a regex demo .

Example code

import re

pattern = r"^[^:]*:\s*([^?]+\?)\s+(?:\([^()]*\)?\s)?(.*)"

s = ("Task 1: Question 1? answer1\n"
            "Task 2: Question 2? (Format:****) answer2\n"
            "Task 3: Question 3? answer3\n"
            "Task 4: Question 4? (Format:*****) answer4")

matches = re.finditer(pattern, s, re.MULTILINE)
matches1 = []
matches2 = []
for matchNum, match in enumerate(matches, start=1):
    matches1.append(match.group(1))
    matches2.append(match.group(2))

print(matches1)
print(matches2)

Output

['Question 1?', 'Question 2?', 'Question 3?', 'Question 4?']
['answer1', 'answer2', 'answer3', 'answer4']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM