简体   繁体   中英

How to remove a custom word pattern from a text using NLTK with Python

I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK.
So first I want to take out each question separately from the text.The question paper format is given below.

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

So now I want to extract the questions one by one without having the question number(Question number format is always same as given above).So my result should be something like this.

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

So how can tackle this problem with python 3.4 with NLTK?
Thank you

You'll probably need to detect lines containing a question, then extract the question and drop the question number. The regexp for detecting a question label is

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

You can use it to pull out the questions like this:

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

Obviously, text must be a list of lines or a file open for reading.

But if you had no idea how to approach this, you have your work cut out for you with the rest of the assignment. I recommend spending some time on the python tutorial or other introductory materials.

In case every sentence starts with this pattern, what you ask for is easy to parse, you can use split to remove this prefix:

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

This will print:

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

If the (QX) always separated by a space before the text, you can do this:

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM