从带有特定单词，标签组合的带有pos标签的语料库中提取句子

Question

I'm playing with the brown corpus, specifically the tagged sentences in "news." 我正在玩棕色语料库，特别是“新闻”中带标签的句子。 I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). 我发现“ to”是带有最多歧义词标签的词（TO，IN，TO-HL，IN-HL，IN-TL，NPS）。 I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". 我正在尝试编写一个代码，该代码将从与“ to”关联的每个标签的语料库中打印一个句子。 The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags. 句子不需要“清除”标签，而只包含“ to”和每个相关的pos标签。

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. 我仅使用pos标签之一尝试了上面的代码，以查看其是否有效，但它会打印所有示例。 I need it to print just the first found sentence that matches the word, tag and then stop. 我需要它仅打印找到的第一个与单词，标记匹配的句子，然后停止。 I tried this: 我尝试了这个：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

This works with this pos-tag because it's the first one related to "to", but if I use: 这是与pos-tag一起使用的，因为它是第一个与“ to”相关的标签，但是如果我使用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

It returns nothing. 它什么也不返回。 I think I am SO close -- care to help? 我想我太贴心了-愿意帮忙吗？

Answer 1

You can continue to add to your current code but your code didn't consider these things: 您可以继续添加到当前代码中，但是您的代码没有考虑以下因素：

What happen if 'to' happens more than once in the sentence with same or diff POS? 如果“ to”在相同或不同的POS语句中多次出现，该怎么办？
Do you want the sentence to be printed twice if you 'to' with the same POS appeared twice in the sentence? 如果您使用同一POS的“ to”出现在句子中两次，是否要将该句子打印两次？
What happen if 'to' appears in first word of the sentence and it's Capitalized? 如果“ to”出现在句子的第一个单词中并且大写，该怎么办？

If you want to stick with your code try this: 如果您想坚持使用代码，请尝试以下操作：

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

I suggest that you store the sentence in a defaultdict(list) , then you can retrieve them anytime. 我建议您将句子存储在defaultdict(list) ，然后可以随时检索它们。

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

To access the sentences of a specific POS: 要访问特定POS的句子，请执行以下操作：

for sent in sents_with_to['TO']:
    print sent

You'll realized that if 'to' with a specific POS appears twice in the sentence. 您将意识到，如果对特定POS使用“ to”在句子中出现两次。 It's recorded twice in sents_with_to[pos] . 它在sents_with_to[pos]记录了两次。 If you want to remove them, try: 如果要删除它们，请尝试：

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

Answer 2

With regards to why this isn't working: 关于为什么这不起作用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

Before explanation, your code is not really close to the output that you desire. 在进行解释之前，您的代码并没有真正接近您想要的输出。 It's because your if-else statements are not really doing what you need. 这是因为您的if-else语句并未真正满足您的需求。

First you need to understand what the multiple conditions(ie 'if') are doing. 首先，您需要了解多个条件（即“ if”）在做什么。

# Loop through the sentence
for sent in brown_sents:
  # Loop through each word with its POS
  for word, tag in sent:
    # For each sentence checks whether word and tag is in sentence:
    if word == 'to' and tag == 'TO-HL':
      print sent # If the condition is true, print sent
    # After checking the first if, you continue to check the second if
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still
    # in the same iteration as the previous condition.
   if word != 'to' and tag != 'TO-HL':
     break

Now let's start with some basic if-else statement: 现在让我们从一些基本的if-else语句开始：

>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
...     if word != 'to' and pos != 'TO-HL':
...             break
...     else:
...             print 'say hi'
... 
>>>

From the example above we looped through each word+POS in the sentnece and at EVERY pair of word-pos, the if condition will check if it is not the word 'to' and not the pos 'TO-HL' and if that is the case it breaks and never say hi to you. 从上面的示例中，我们遍历了句子中的每个单词+ POS和每对单词pos， if条件将检查它是否不是单词'to'而不是pos'TO-HL'，如果是这样的话就坏了，永远不要对你say hi 。

So if you keep your code with the if-else conditions you will ALWAYS break without continuing the loop because to is not the first word in the sentence and the matching pos is not right. 因此，如果将代码保持在if-else条件下，则始终会中断而不会继续循环，因为to不是句子中的第一个单词，匹配的pos是不正确的。

In fact, your if condition is trying to check whether EVERY word is a 'to' and whether its POS tag is 'TO-HL'. 实际上，您的if条件试图检查每个单词是否为“ to”以及其POS标签是否为“ TO-HL”。

What you want to do is to check: 您要做的是检查：

whether 'to' is in the sentence instead of whether every word is 'to' and thereafter check 句子中是否使用“ to”，而不是每个单词是否使用“ to”，然后检查
whether the 'to' in the sentence holds the POS tag you're looking for 句子中的“ to”是否包含您要查找的POS标签

So the if conditions you need for condition (1) is: 因此，条件（1）所需的if条件为：

>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

Now you know that if 'to' in dict(sent) checks whether 'to' is in sentence. 现在您知道if 'to' in dict(sent) “ to”是否检查句子中的“ to”。

Then to check for condition (2): 然后检查条件（2）：

>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO':
...                     print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO-HL':
...                     print sent
... 
>>>

Now you see that if dict(sent)['to'] == 'TO-HL' AFTER you have checked that if 'to' in dict(sent) controls the condition to check for pos restrictions. 现在您看到， if dict(sent)['to'] == 'TO-HL' 之后，您已经检查if 'to' in dict(sent)控制检查pos限制的条件。

But you realized that if you have 2 'to' in the sentence dict(sent)['to'] only captures the POS of the final 'to'. 但是您意识到，如果在dict(sent)['to']句子中有2个“ to”，则只能捕获最后一个“ to”的POS。 That is why you need the defaultdict(list) as suggested in the previous answer. 这就是为什么您需要上一个答案中建议的defaultdict(list)原因。

There is really no clean way to perform the checks and the most efficient way is described the previous answer, sigh. 确实，没有一种干净的方法可以执行检查，最有效的方法是前面的答案，叹了口气。

从带有特定单词，标签组合的带有pos标签的语料库中提取句子

问题描述

2 个解决方案

解决方案1
2 2014-11-20 22:20:45

解决方案2
1 2014-11-21 16:44:47

从带有特定单词，标签组合的带有pos标签的语料库中提取句子

问题描述

2 个解决方案

解决方案1 2 2014-11-20 22:20:45

解决方案2 1 2014-11-21 16:44:47

解决方案1
2 2014-11-20 22:20:45

解决方案2
1 2014-11-21 16:44:47