简体   繁体   中英

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".

The data I am trying to extract is this:

Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and
fees, any remaining funds will be sent to you as a refund that will
either be directly deposited (which can be set up through your
account) or mailed to you as a paper check. You can then use the
refund to pay your rent. It is important to note that financial aid may
not be available when rent is due, so make sure to have a plan in
place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?

If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you

I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark

".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document

wordDoc = Document('botDoc.docx')

result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
    print(result.group(0))
for table in wordDoc.tables:
    for row in table.rows:
        for cell in row.cells:
            print("test")

I expect to save the matching patterns into directories so I can export the data to a csv file

Your error:

result = re.search('.*[?=?]$', wordDoc)

I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.

What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.

Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.

FYI, the way you would do this for a simple text file is the following:

# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())

Your Regex:

Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ? , one of your matches is place to pay your rent. Will my financial aid pay for housing? place to pay your rent. Will my financial aid pay for housing? , for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:

[A-Z].*\?$

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM