extracting sentences from pos-tagged corpus with certain word, tag combos

Question

I'm playing with the brown corpus, specifically the tagged sentences in "news." I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags.

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. I need it to print just the first found sentence that matches the word, tag and then stop. I tried this:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

This works with this pos-tag because it's the first one related to "to", but if I use:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

It returns nothing. I think I am SO close -- care to help?

Answer 1

You can continue to add to your current code but your code didn't consider these things:

What happen if 'to' happens more than once in the sentence with same or diff POS?
Do you want the sentence to be printed twice if you 'to' with the same POS appeared twice in the sentence?
What happen if 'to' appears in first word of the sentence and it's Capitalized?

If you want to stick with your code try this:

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

I suggest that you store the sentence in a defaultdict(list) , then you can retrieve them anytime.

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

To access the sentences of a specific POS:

for sent in sents_with_to['TO']:
    print sent

You'll realized that if 'to' with a specific POS appears twice in the sentence. It's recorded twice in sents_with_to[pos] . If you want to remove them, try:

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

Answer 2

With regards to why this isn't working:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

Before explanation, your code is not really close to the output that you desire. It's because your if-else statements are not really doing what you need.

First you need to understand what the multiple conditions(ie 'if') are doing.

# Loop through the sentence
for sent in brown_sents:
  # Loop through each word with its POS
  for word, tag in sent:
    # For each sentence checks whether word and tag is in sentence:
    if word == 'to' and tag == 'TO-HL':
      print sent # If the condition is true, print sent
    # After checking the first if, you continue to check the second if
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still
    # in the same iteration as the previous condition.
   if word != 'to' and tag != 'TO-HL':
     break

Now let's start with some basic if-else statement:

>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
...     if word != 'to' and pos != 'TO-HL':
...             break
...     else:
...             print 'say hi'
... 
>>>

From the example above we looped through each word+POS in the sentnece and at EVERY pair of word-pos, the if condition will check if it is not the word 'to' and not the pos 'TO-HL' and if that is the case it breaks and never say hi to you.

So if you keep your code with the if-else conditions you will ALWAYS break without continuing the loop because to is not the first word in the sentence and the matching pos is not right.

In fact, your if condition is trying to check whether EVERY word is a 'to' and whether its POS tag is 'TO-HL'.

What you want to do is to check:

whether 'to' is in the sentence instead of whether every word is 'to' and thereafter check
whether the 'to' in the sentence holds the POS tag you're looking for

So the if conditions you need for condition (1) is:

>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

Now you know that if 'to' in dict(sent) checks whether 'to' is in sentence.

Then to check for condition (2):

>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO':
...                     print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO-HL':
...                     print sent
... 
>>>

Now you see that if dict(sent)['to'] == 'TO-HL' AFTER you have checked that if 'to' in dict(sent) controls the condition to check for pos restrictions.

But you realized that if you have 2 'to' in the sentence dict(sent)['to'] only captures the POS of the final 'to'. That is why you need the defaultdict(list) as suggested in the previous answer.

There is really no clean way to perform the checks and the most efficient way is described the previous answer, sigh.

extracting sentences from pos-tagged corpus with certain word, tag combos

Question

2 answers

solution1
2 2014-11-20 22:20:45

solution2
1 2014-11-21 16:44:47

extracting sentences from pos-tagged corpus with certain word, tag combos

Question

2 answers

solution1 2 2014-11-20 22:20:45

solution2 1 2014-11-21 16:44:47

solution1
2 2014-11-20 22:20:45

solution2
1 2014-11-21 16:44:47