[英]extracting sentences from pos-tagged corpus with certain word, tag combos
I'm playing with the brown corpus, specifically the tagged sentences in "news." 我正在玩棕色语料库,特别是“新闻”中带标签的句子。 I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). 我发现“ to”是带有最多歧义词标签的词(TO,IN,TO-HL,IN-HL,IN-TL,NPS)。 I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". 我正在尝试编写一个代码,该代码将从与“ to”关联的每个标签的语料库中打印一个句子。 The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags. 句子不需要“清除”标签,而只包含“ to”和每个相关的pos标签。
brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == "IN"):
print sent
I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. 我仅使用pos标签之一尝试了上面的代码,以查看其是否有效,但它会打印所有示例。 I need it to print just the first found sentence that matches the word, tag and then stop. 我需要它仅打印找到的第一个与单词,标记匹配的句子,然后停止。 I tried this: 我尝试了这个:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'IN'):
print sent
if (word != 'to' and tag != 'IN'):
break
This works with this pos-tag because it's the first one related to "to", but if I use: 这是与pos-tag一起使用的,因为它是第一个与“ to”相关的标签,但是如果我使用:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
It returns nothing. 它什么也不返回。 I think I am SO close -- care to help? 我想我太贴心了-愿意帮忙吗?
You can continue to add to your current code but your code didn't consider these things: 您可以继续添加到当前代码中,但是您的代码没有考虑以下因素:
If you want to stick with your code try this: 如果您想坚持使用代码,请尝试以下操作:
from nltk.corpus import brown
brown_sents = brown.tagged_sents(categories="news")
def to_pos_sent(pos):
for sent in brown_sents:
for word, tag in sent:
if word == 'to' and tag == pos:
yield sent
for sent in to_pos_sent('TO'):
print sent
for sent in to_pos_sent('IN'):
print sent
I suggest that you store the sentence in a defaultdict(list)
, then you can retrieve them anytime. 我建议您将句子存储在defaultdict(list)
,然后可以随时检索它们。
from nltk.corpus import brown
from collections import Counter, defaultdict
sents_with_to = defaultdict(list)
to_counts = Counter()
for i, sent in enumerate(brown.tagged_sents(categories='news')):
# Check if 'to' is in sentence.
uniq_words = dict(sent)
if 'to' in uniq_words or 'To' in uniq_words:
# Iterate through the sentence to find 'to'
for word, pos in sent:
if word.lower()=='to':
# Flatten the sentence into a string
sents_with_to[pos].append(sent)
to_counts[pos]+=1
for pos in sents_with_to:
for sent in sents_with_to[pos]:
print pos, sent
To access the sentences of a specific POS: 要访问特定POS的句子,请执行以下操作:
for sent in sents_with_to['TO']:
print sent
You'll realized that if 'to' with a specific POS appears twice in the sentence. 您将意识到,如果对特定POS使用“ to”在句子中出现两次。 It's recorded twice in sents_with_to[pos]
. 它在sents_with_to[pos]
记录了两次。 If you want to remove them, try: 如果要删除它们,请尝试:
sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))
With regards to why this isn't working: 关于为什么这不起作用:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
Before explanation, your code is not really close to the output that you desire. 在进行解释之前,您的代码并没有真正接近您想要的输出。 It's because your if-else
statements are not really doing what you need. 这是因为您的if-else
语句并未真正满足您的需求。
First you need to understand what the multiple conditions(ie 'if') are doing. 首先,您需要了解多个条件(即“ if”)在做什么。
# Loop through the sentence
for sent in brown_sents:
# Loop through each word with its POS
for word, tag in sent:
# For each sentence checks whether word and tag is in sentence:
if word == 'to' and tag == 'TO-HL':
print sent # If the condition is true, print sent
# After checking the first if, you continue to check the second if
# if word is not 'to' and tag is not 'TO-HL',
# you want to break out of the sentence. Note that you are still
# in the same iteration as the previous condition.
if word != 'to' and tag != 'TO-HL':
break
Now let's start with some basic if-else
statement: 现在让我们从一些基本的if-else
语句开始:
>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
... if word != 'to' and pos != 'TO-HL':
... break
... else:
... print 'say hi'
...
>>>
From the example above we looped through each word+POS in the sentnece and at EVERY pair of word-pos, the if
condition will check if it is not the word 'to' and not the pos 'TO-HL' and if that is the case it breaks and never say hi
to you. 从上面的示例中,我们遍历了句子中的每个单词+ POS和每对单词pos, if
条件将检查它是否不是单词'to'而不是pos'TO-HL',如果是这样的话就坏了,永远不要对你say hi
。
So if you keep your code with the if-else
conditions you will ALWAYS break without continuing the loop because to
is not the first word in the sentence and the matching pos is not right. 因此,如果将代码保持在if-else
条件下,则始终会中断而不会继续循环,因为to
不是句子中的第一个单词,匹配的pos是不正确的。
In fact, your if
condition is trying to check whether EVERY word is a 'to' and whether its POS tag is 'TO-HL'. 实际上,您的if
条件试图检查每个单词是否为“ to”以及其POS标签是否为“ TO-HL”。
What you want to do is to check: 您要做的是检查:
So the if
conditions you need for condition (1) is: 因此,条件(1)所需的if
条件为:
>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
Now you know that if 'to' in dict(sent)
checks whether 'to' is in sentence. 现在您知道if 'to' in dict(sent)
“ to”是否检查句子中的“ to”。
Then to check for condition (2): 然后检查条件(2):
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO':
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO-HL':
... print sent
...
>>>
Now you see that if dict(sent)['to'] == 'TO-HL'
AFTER you have checked that if 'to' in dict(sent)
controls the condition to check for pos restrictions. 现在您看到, if dict(sent)['to'] == 'TO-HL'
之后,您已经检查if 'to' in dict(sent)
控制检查pos限制的条件。
But you realized that if you have 2 'to' in the sentence dict(sent)['to']
only captures the POS of the final 'to'. 但是您意识到,如果在dict(sent)['to']
句子中有2个“ to”,则只能捕获最后一个“ to”的POS。 That is why you need the defaultdict(list)
as suggested in the previous answer. 这就是为什么您需要上一个答案中建议的defaultdict(list)
原因。
There is really no clean way to perform the checks and the most efficient way is described the previous answer, sigh. 确实,没有一种干净的方法可以执行检查,最有效的方法是前面的答案,叹了口气。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.