從帶有特定單詞，標簽組合的帶有pos標簽的語料庫中提取句子

Question

我正在玩棕色語料庫，特別是“新聞”中帶標簽的句子。 我發現“ to”是帶有最多歧義詞標簽的詞（TO，IN，TO-HL，IN-HL，IN-TL，NPS）。 我正在嘗試編寫一個代碼，該代碼將從與“ to”關聯的每個標簽的語料庫中打印一個句子。 句子不需要“清除”標簽，而只包含“ to”和每個相關的pos標簽。

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

我僅使用pos標簽之一嘗試了上面的代碼，以查看其是否有效，但它會打印所有示例。 我需要它僅打印找到的第一個與單詞，標記匹配的句子，然后停止。 我嘗試了這個：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

這是與pos-tag一起使用的，因為它是第一個與“ to”相關的標簽，但是如果我使用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

它什么也不返回。 我想我太貼心了-願意幫忙嗎？

Answer 1

您可以繼續添加到當前代碼中，但是您的代碼沒有考慮以下因素：

如果“ to”在相同或不同的POS語句中多次出現，該怎么辦？
如果您使用同一POS的“ to”出現在句子中兩次，是否要將該句子打印兩次？
如果“ to”出現在句子的第一個單詞中並且大寫，該怎么辦？

如果您想堅持使用代碼，請嘗試以下操作：

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

我建議您將句子存儲在defaultdict(list) ，然后可以隨時檢索它們。

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

要訪問特定POS的句子，請執行以下操作：

for sent in sents_with_to['TO']:
    print sent

您將意識到，如果對特定POS使用“ to”在句子中出現兩次。 它在sents_with_to[pos]記錄了兩次。 如果要刪除它們，請嘗試：

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

Answer 2

關於為什么這不起作用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

在進行解釋之前，您的代碼並沒有真正接近您想要的輸出。 這是因為您的if-else語句並未真正滿足您的需求。

首先，您需要了解多個條件（即“ if”）在做什么。

# Loop through the sentence
for sent in brown_sents:
  # Loop through each word with its POS
  for word, tag in sent:
    # For each sentence checks whether word and tag is in sentence:
    if word == 'to' and tag == 'TO-HL':
      print sent # If the condition is true, print sent
    # After checking the first if, you continue to check the second if
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still
    # in the same iteration as the previous condition.
   if word != 'to' and tag != 'TO-HL':
     break

現在讓我們從一些基本的if-else語句開始：

>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
...     if word != 'to' and pos != 'TO-HL':
...             break
...     else:
...             print 'say hi'
... 
>>>

從上面的示例中，我們遍歷了句子中的每個單詞+ POS和每對單詞pos， if條件將檢查它是否不是單詞'to'而不是pos'TO-HL'，如果是這樣的話就壞了，永遠不要對你say hi 。

因此，如果將代碼保持在if-else條件下，則始終會中斷而不會繼續循環，因為to不是句子中的第一個單詞，匹配的pos是不正確的。

實際上，您的if條件試圖檢查每個單詞是否為“ to”以及其POS標簽是否為“ TO-HL”。

您要做的是檢查：

句子中是否使用“ to”，而不是每個單詞是否使用“ to”，然后檢查
句子中的“ to”是否包含您要查找的POS標簽

因此，條件（1）所需的if條件為：

>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

現在您知道if 'to' in dict(sent) “ to”是否檢查句子中的“ to”。

然后檢查條件（2）：

>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO':
...                     print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO-HL':
...                     print sent
... 
>>>

現在您看到， if dict(sent)['to'] == 'TO-HL' 之后，您已經檢查if 'to' in dict(sent)控制檢查pos限制的條件。

但是您意識到，如果在dict(sent)['to']句子中有2個“ to”，則只能捕獲最后一個“ to”的POS。 這就是為什么您需要上一個答案中建議的defaultdict(list)原因。

確實，沒有一種干凈的方法可以執行檢查，最有效的方法是前面的答案，嘆了口氣。

從帶有特定單詞，標簽組合的帶有pos標簽的語料庫中提取句子

問題描述

2 個解決方案

解決方案1
2 2014-11-20 22:20:45

解決方案2
1 2014-11-21 16:44:47

從帶有特定單詞，標簽組合的帶有pos標簽的語料庫中提取句子

問題描述

2 個解決方案

解決方案1 2 2014-11-20 22:20:45

解決方案2 1 2014-11-21 16:44:47

解決方案1
2 2014-11-20 22:20:45

解決方案2
1 2014-11-21 16:44:47