[英]n-grams from text in python
我以前的帖子的更新,其中有一些更改:
假设我有100条推文。 在这些推文中,我需要提取:1)食物名称,和2)饮料名称。 我还需要为每种提取物附加类型(饮料或食物)和ID号(每个项目都有唯一的ID)。
我已经有一个包含名称,类型和ID号的词典:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
鸣叫示例:
经过对“ tweet_1”的各种处理,我得到以下句子:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
我要求的输出(可以是list以外的其他类型 ):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
重要的是输出不要提取ngram(n> 1)内的unigram:
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
理想情况下,我希望能够在提取之前在各种nltk过滤器(例如lemmatize()和pos_tag())中运行我的语句,以获得类似以下的输出。 但是,使用这种正则表达式解决方案,如果我这样做,那么所有单词都将被拆分为unigram,或者它们将从字符串“ coca cola”中生成1 unigram和1 bigram,这将生成我不想拥有的输出(如上面的示例)。 理想的输出(同样,输出的类型并不重要):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]
可能不是最有效的解决方案,但这肯定会让您入门-
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')
print(chunks)
产量
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
说明
lexicon_list = list(lexicon.keys())
需要搜索的短语列表,并按长度对其进行排序(以便首先找到更大的块)
输出是dict
的列表,其中每个dict具有list
值。
不幸的是,由于声誉低下,我无法发表评论,但是Vivek的答案可以通过以下方法得到改善:1)正则表达式,2)包括pos_tag标记作为NN,3)字典结构,您可以在其中选择通过推文发布的推文结果:
import re
import nltk
from collections import OrderedDict
tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) + ")"
pattern = re.compile(pattern)
# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}
#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
chunks[tweet] = []
for sentence in tweets[tweet]:
temp = OrderedDict()
for word in pattern.findall(sentence):
temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
chunks[tweet].append((temp))
最终输出为:
OrderedDict([('tweet_1',
[OrderedDict([('dr pepper',
[[('dr', 'NN'), ('pepper', 'NN')],
['drink', 'd_123']]),
('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana split',
[[('banana', 'NN'), ('split', 'NN')],
['food', 'f_567']]),
('ice cream',
[[('ice', 'NN'), ('cream', 'NN')],
['food', 'f_789']])]),
OrderedDict([('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana',
[[('banana', 'NN')], ['food', 'f_456']])])])])
我会为循环过滤..
使用if语句在键中查找字符串。.如果要包括字母组合,请删除
len(key.split()) > 1
如果您只想包含字母组合,则将其更改为:
len(key.split()) == 1
filtered_list = ['tweet_id_1']
for k, v in lexicon.items():
for s in sentences:
if k in s and len(k.split()) > 1:
filtered_list.extend((k, v))
print(filtered_list)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.