简体   繁体   English

如何从列表列表过滤特定的POS标签到单独的列表?

[英]How to filter specific POS tags from list of lists to separate lists?

I have huge data of product descriptions and required to separate the product names and the intent from descriptions for which i found out separating NNP tags after tagging the text with POS tags is somewhat helpful for further cleansing. 我拥有大量的产品描述数据,并且需要将产品名称和意图与描述分开,在使用POS标签标记文本后发现分离出NNP标签对进一步清洁有些帮助。

I have the following similar data for which i want to only filter NNP tags and want them to be filtered in their respective list, but unable to do so. 我有以下类似的数据,我只希望过滤NNP标签,并希望将它们过滤在各自的列表中,但无法这样做。

 data = [[('User', 'NNP'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('iShopCatalog', 'NN'),
  ('Coala', 'NNP'),
  ('excluding', 'VBG'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('VWR', 'NNP')],
 [('Arfter', 'NNP'),
  ('transferring', 'VBG'),
  ('the', 'DT'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('COALA', 'NNP'),
  ('to', 'TO'),
  ('SRM', 'VB'),
  ('the', 'DT'),
  ('Category', 'NNP'),
  ('S9901', 'NNP'),
  ('Dummy', 'NNP'),
  ('is', 'VBZ'),
  ('maintained', 'VBN')],
 [('Due', 'JJ'),
  ('to', 'TO'),
  ('this', 'DT'),
  ('the', 'DT'),
  ('user', 'NN'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('the', 'DT'),
  ('product', 'NN')],
 [('All', 'DT'),
  ('other', 'JJ'),
  ('users', 'NNS'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('these', 'DT'),
  ('articles', 'NNS')],
 [('She', 'PRP'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('other', 'JJ'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('a', 'DT'),
  ('POETcatalog', 'NNP'),
  ('without', 'IN'),
  ('any', 'DT'),
  ('problems', 'NNS')],
 [('Furtheremore', 'IN'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('the', 'DT'),
  ('Vendor', 'NNP'),
  ('VWR', 'NNP'),
  ('through', 'IN'),
  ('COALA', 'NNP')],
 [('But', 'CC'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('all', 'DT'),
  ('other', 'JJ'),
  ('suppliers', 'NNS'),
  ('are', 'VBP'),
  ('not', 'RB'),
  ('orderable', 'JJ')],
 [('I', 'PRP'),
  ('already', 'RB'),
  ('spoke', 'VBD'),
  ('to', 'TO'),
  ('anic', 'VB'),
  ('who', 'WP'),
  ('maintain', 'VBP'),
  ('the', 'DT'),
  ('catalog', 'NN'),
  ('COALA', 'NNP'),
  ('and', 'CC'),
  ('they', 'PRP'),
  ('said', 'VBD'),
  ('that', 'IN'),
  ('the', 'DT'),
  ('reason', 'NN'),
  ('should', 'MD'),
  ('be', 'VB'),
  ('the', 'DT'),
  ('assignment', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('plant', 'NN')],
 [('User', 'NNP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('assinged', 'JJ'),
  ('to', 'TO'),
  ('Universitaet', 'NNP'),
  ('Regensburg', 'NNP'),
  ('in', 'IN'),
  ('Scout', 'NNP'),
  ('but', 'CC'),
  ('in', 'IN'),
  ('P17', 'NNP'),
  ('table', 'NN'),
  ('YESRMCDMUSER01', 'NNP'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('assigned', 'VBN'),
  ('to', 'TO'),
  ('company', 'NN'),
  ('001500', 'CD'),
  ('Merck', 'NNP'),
  ('KGaA', 'NNP')],
 [('Please', 'NNP'),
  ('find', 'VB'),
  ('attached', 'JJ'),
  ('some', 'DT'),
  ('screenshots', 'NNS')]]

I wrote the following code: 我写了以下代码:

def prodname(a):
    p = []
    for i in a:
        for j in range(len(i)):
            if i[j][1]=='NNP':
                p.append(i[j][0])
    return p

which is giving the following output: 它给出以下输出:

    ['User',
     'Coala',
     'VWR',
     'Arfter',
     'COALA',
     'Category',
     'S9901',
     'Dummy',
     'POETcatalog',
     'Vendor',
     'VWR',
     'COALA',
     'COALA',
     'User',
     'Universitaet',
     'Regensburg',
     'Scout',
     'P17',
     'YESRMCDMUSER01',
     'Merck',
     'KGaA',
     'Please']

The output i would like to get is: 我想得到的输出是:

[['User',
  'Coala',
  'VWR']
['Arfter',
 'COALA',
 'Category',
 'S9901',
 'Dummy']
[],
[],
['POETcatalog'],
['Vendor',
 'VWR',
 'COALA'],
[],
['COALA'],
['User',
 'Universitaet',
 'Regensburg',
 'Scout',
 'P17',
 'YESRMCDMUSER01',
 'Merck',
'KGaA'],
['Please']]

Also tried to use [[] for i in range(len(data)] to append to their respective lists, but couldn't do so. 还尝试对[[] for i in range(len(data)]使用[[] for i in range(len(data)]来附加到它们各自的列表,但是不能这样做。

You can just use this list comprehension: 您可以使用以下列表理解:

[[j[0] for j in i if j[-1]=="NNP"] for i in data]

Output: 输出:

[['User', 'Coala', 'VWR'], ['Arfter', 'COALA', 'Category', 'S9901', 'Dummy'], [], [], ['POETcatalog'], ['Vendor', 'VWR', 'COALA'], [], ['COALA'], ['User', 'Universitaet', 'Regensburg', 'Scout', 'P17', 'YESRMCDMUSER01', 'Merck', 'KGaA'], ['Please']]

List comprehension is the way to go. 列表理解是必经之路。 But @McGrady answer might be a little hard to read. 但是@McGrady的答案可能有点难以理解。

Here's an easier to read solution: 这是一个易于阅读的解决方案:

document = [[('User', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('iShopCatalog', 'NN'), ('Coala', 'NNP'), ('excluding', 'VBG'), ('articles', 'NNS'), ('from', 'IN'), ('VWR', 'NNP')], [('Arfter', 'NNP'), ('transferring', 'VBG'), ('the', 'DT'), ('articles', 'NNS'), ('from', 'IN'), ('COALA', 'NNP'), ('to', 'TO'), ('SRM', 'VB'), ('the', 'DT'), ('Category', 'NNP'), ('S9901', 'NNP'), ('Dummy', 'NNP'), ('is', 'VBZ'), ('maintained', 'VBN')], [('Due', 'JJ'), ('to', 'TO'), ('this', 'DT'), ('the', 'DT'), ('user', 'NN'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('the', 'DT'), ('product', 'NN')], [('All', 'DT'), ('other', 'JJ'), ('users', 'NNS'), ('can', 'MD'), ('order', 'NN'), ('these', 'DT'), ('articles', 'NNS')], [('She', 'PRP'), ('can', 'MD'), ('order', 'NN'), ('other', 'JJ'), ('products', 'NNS'), ('from', 'IN'), ('a', 'DT'), ('POETcatalog', 'NNP'), ('without', 'IN'), ('any', 'DT'), ('problems', 'NNS')], [('Furtheremore', 'IN'), ('she', 'PRP'), ('is', 'VBZ'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('Vendor', 'NNP'), ('VWR', 'NNP'), ('through', 'IN'), ('COALA', 'NNP')], [('But', 'CC'), ('articles', 'NNS'), ('from', 'IN'), ('all', 'DT'), ('other', 'JJ'), ('suppliers', 'NNS'), ('are', 'VBP'), ('not', 'RB'), ('orderable', 'JJ')], [('I', 'PRP'), ('already', 'RB'), ('spoke', 'VBD'), ('to', 'TO'), ('anic', 'VB'), ('who', 'WP'), ('maintain', 'VBP'), ('the', 'DT'), ('catalog', 'NN'), ('COALA', 'NNP'), ('and', 'CC'), ('they', 'PRP'), ('said', 'VBD'), ('that', 'IN'), ('the', 'DT'), ('reason', 'NN'), ('should', 'MD'), ('be', 'VB'), ('the', 'DT'), ('assignment', 'NN'), ('of', 'IN'), ('the', 'DT'), ('plant', 'NN')], [('User', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('assinged', 'JJ'), ('to', 'TO'), ('Universitaet', 'NNP'), ('Regensburg', 'NNP'), ('in', 'IN'), ('Scout', 'NNP'), ('but', 'CC'), ('in', 'IN'), ('P17', 'NNP'), ('table', 'NN'), ('YESRMCDMUSER01', 'NNP'), ('she', 'PRP'), ('is', 'VBZ'), ('assigned', 'VBN'), ('to', 'TO'), ('company', 'NN'), ('001500', 'CD'), ('Merck', 'NNP'), ('KGaA', 'NNP')], [('Please', 'NNP'), ('find', 'VB'), ('attached', 'JJ'), ('some', 'DT'), ('screenshots', 'NNS')]]
output = [[word for word, pos in sentence if pos=='NNP'] for sentence in document]

If you like cleaner code and you can wrap your head around nested list comprehension, https://stackoverflow.com/a/3633145/610569 : 如果您喜欢更简洁的代码,并且可以将其用于嵌套列表理解, 请访问https://stackoverflow.com/a/3633145/610569

output = [word for sentence in document for word, pos in sentence if pos=='NNP']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM