如何使用NLTK和Python从文本中删除自定义单词模式

Question

I am currently working on a project of analyzing the quality examination paper questions.In here I am using Python 3.4 with NLTK. 我目前正在研究一个分析质量试卷问题的项目。在这里，我将Python 3.4与NLTK一起使用。
So first I want to take out each question separately from the text.The question paper format is given below. 所以首先我要把每个问题从文本中分开拿出。下面是试卷格式。

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

So now I want to extract the questions one by one without having the question number(Question number format is always same as given above).So my result should be something like this. 所以现在我要一个没有问题号的问题就一个一个地提取问题（问题号格式总是和上面给出的一样），所以我的结果应该是这样的。

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

So how can tackle this problem with python 3.4 with NLTK? 那么如何使用NLTK的python 3.4解决这个问题呢？
Thank you 谢谢

Answer 1

You'll probably need to detect lines containing a question, then extract the question and drop the question number. 您可能需要检测包含问题的行，然后提取问题并删除问题编号。 The regexp for detecting a question label is 用于检测问题标签的regexp是

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

You can use it to pull out the questions like this: 您可以使用它来提出这样的问题：

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

Obviously, text must be a list of lines or a file open for reading. 显然， text必须是行列表或打开的文件以供阅读。

But if you had no idea how to approach this, you have your work cut out for you with the rest of the assignment. 但是，如果您不知道如何解决这个问题，那么剩下的工作就可以帮您完成工作。 I recommend spending some time on the python tutorial or other introductory materials. 我建议花一些时间在python教程或其他入门资料上。

Answer 2

In case every sentence starts with this pattern, what you ask for is easy to parse, you can use split to remove this prefix: 如果每个句子都以这种模式开头，那么您所要求的内容很容易解析，您可以使用split删除此前缀：

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

This will print: 这将打印：

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

Answer 3

If the (QX) always separated by a space before the text, you can do this: 如果(QX)始终在文本之前用空格隔开，则可以执行以下操作：

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

如何使用NLTK和Python从文本中删除自定义单词模式

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-06-07 13:25:56

解决方案2
1 2015-06-07 13:17:40

解决方案3
1 2015-06-07 13:38:40

如何使用NLTK和Python从文本中删除自定义单词模式

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-06-07 13:25:56

解决方案2 1 2015-06-07 13:17:40

解决方案3 1 2015-06-07 13:38:40

解决方案1
2 已采纳 2015-06-07 13:25:56

解决方案2
1 2015-06-07 13:17:40

解决方案3
1 2015-06-07 13:38:40