繁体   English   中英

从带有特定单词,标签组合的带有pos标签的语料库中提取句子

[英]extracting sentences from pos-tagged corpus with certain word, tag combos

我正在玩棕色语料库,特别是“新闻”中带标签的句子。 我发现“ to”是带有最多歧义词标签的词(TO,IN,TO-HL,IN-HL,IN-TL,NPS)。 我正在尝试编写一个代码,该代码将从与“ to”关联的每个标签的语料库中打印一个句子。 句子不需要“清除”标签,而只包含“ to”和每个相关的pos标签。

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

我仅使用pos标签之一尝试了上面的代码,以查看其是否有效,但它会打印所有示例。 我需要它仅打印找到的第一个与单词,标记匹配的句子,然后停止。 我尝试了这个:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

这是与pos-tag一起使用的,因为它是第一个与“ to”相关的标签,但是如果我使用:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

它什么也不返回。 我想我太贴心了-愿意帮忙吗?

您可以继续添加到当前代码中,但是您的代码没有考虑以下因素:

  • 如果“ to”在相同或不同的POS语句中多次出现,该怎么办?
  • 如果您使用同一POS的“ to”出现在句子中两次,是否要将该句子打印两次?
  • 如果“ to”出现在句子的第一个单词中并且大写,该怎么办?

如果您想坚持使用代码,请尝试以下操作:

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

我建议您将句子存储在defaultdict(list) ,然后可以随时检索它们。

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

要访问特定POS的句子,请执行以下操作:

for sent in sents_with_to['TO']:
    print sent

您将意识到,如果对特定POS使用“ to”在句子中出现两次。 它在sents_with_to[pos]记录了两次。 如果要删除它们,请尝试:

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

关于为什么这不起作用:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

在进行解释之前,您的代码并没有真正接近您想要的输出。 这是因为您的if-else语句并未真正满足您的需求。

首先,您需要了解多个条件(即“ if”)在做什么。

# Loop through the sentence
for sent in brown_sents:
  # Loop through each word with its POS
  for word, tag in sent:
    # For each sentence checks whether word and tag is in sentence:
    if word == 'to' and tag == 'TO-HL':
      print sent # If the condition is true, print sent
    # After checking the first if, you continue to check the second if
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still
    # in the same iteration as the previous condition.
   if word != 'to' and tag != 'TO-HL':
     break

现在让我们从一些基本的if-else语句开始:

>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
...     if word != 'to' and pos != 'TO-HL':
...             break
...     else:
...             print 'say hi'
... 
>>> 

从上面的示例中,我们遍历了句子中的每个单词+ POS和对单词pos, if条件将检查它是否不是单词'to'而不是pos'TO-HL',如果是这样的话就坏了,永远不要对你say hi

因此,如果将代码保持在if-else条件下,则始终会中断而不会继续循环,因为to不是句子中的第一个单词,匹配的pos是不正确的。

实际上,您的if条件试图检查每个单词是否为“ to”以及其POS标签是否为“ TO-HL”。


您要做的是检查:

  1. 句子中是否使用“ to”,而不是每个单词是否使用“ to”,然后检查
  2. 句子中的“ to”是否包含您要查找的POS标签

因此,条件(1)所需的if条件为:

>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

现在您知道if 'to' in dict(sent) “ to”是否检查句子中的“ to”。

然后检查条件(2):

>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO':
...                     print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO-HL':
...                     print sent
... 
>>> 

现在您看到, if dict(sent)['to'] == 'TO-HL' 之后,您已经检查if 'to' in dict(sent)控制检查pos限制的条件。

但是您意识到,如果在dict(sent)['to']句子中有2个“ to”,则只能捕获最后一个“ to”的POS。 这就是为什么您需要上一个答案中建议的defaultdict(list)原因。

确实,没有一种干净的方法可以执行检查,最有效的方法是前面的答案,叹了口气。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM