如何使用NLTK和Python從文本中刪除自定義單詞模式

Question

我目前正在研究一個分析質量試卷問題的項目。在這里，我將Python 3.4與NLTK一起使用。
所以首先我要把每個問題從文本中分開拿出。下面是試卷格式。

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

所以現在我要一個沒有問題號的問題就一個一個地提取問題（問題號格式總是和上面給出的一樣），所以我的結果應該是這樣的。

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

那么如何使用NLTK的python 3.4解決這個問題呢？
謝謝

Answer 1

您可能需要檢測包含問題的行，然后提取問題並刪除問題編號。 用於檢測問題標簽的regexp是

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

您可以使用它來提出這樣的問題：

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

顯然， text必須是行列表或打開的文件以供閱讀。

但是，如果您不知道如何解決這個問題，那么剩下的工作就可以幫您完成工作。 我建議花一些時間在python教程或其他入門資料上。

Answer 2

如果每個句子都以這種模式開頭，那么您所要求的內容很容易解析，您可以使用split刪除此前綴：

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

這將打印：

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

Answer 3

如果(QX)始終在文本之前用空格隔開，則可以執行以下操作：

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

如何使用NLTK和Python從文本中刪除自定義單詞模式

問題描述

3 個解決方案

解決方案1
2 已采納 2015-06-07 13:25:56

解決方案2
1 2015-06-07 13:17:40

解決方案3
1 2015-06-07 13:38:40

如何使用NLTK和Python從文本中刪除自定義單詞模式

問題描述

3 個解決方案

解決方案1 2 已采納 2015-06-07 13:25:56

解決方案2 1 2015-06-07 13:17:40

解決方案3 1 2015-06-07 13:38:40

解決方案1
2 已采納 2015-06-07 13:25:56

解決方案2
1 2015-06-07 13:17:40

解決方案3
1 2015-06-07 13:38:40