具有两个单词标记的文本分类

Question

I'm trying to do some key information extraction using NLTK and word tokenization in advertisements. 我正在尝试使用NLTK和广告中的单词标记化进行一些关键信息提取。

For example: "The room has max capacity of 800 people no smoking allowed no children above 12 yr old ..." 例如：“该房间最多可容纳800人，禁止吸烟，不允许12岁以上的儿童入住。”

My question is: - (max capacity) has a meaning different than capacity. 我的问题是：-（最大容量）的含义不同于容量。 - no smoking is different than smoking. -吸烟与吸烟无异。

How can I tokenize composite words to analyse? 如何标记合成词以进行分析？ I don't want to break ["no","smoking"] I want to have a token ["no smoking"] 我不想破坏[“不”，“吸烟”]我想要一个令牌[“不吸烟”]

word_tokenize(text)

Also, when I tokenize and remove stop words I lose the negative meaning of the words. 而且，当我标记和删除停用词时，我会失去这些词的否定含义。

Answer 1

I think what you're looking for is NLTK's ngrams 我认为您要寻找的是NLTK的ngram

from nltk import ngrams

text = "The room has max capacity of 800 people no smoking allowed no children above 12 yr old ..."

pairs = ngrams(text.split(), 2) # change the 2 here to however many words you want in each group

for pair in pairs:
    print(pair)

> ('The', 'room')
('room', 'has')
('has', 'max')
('max', 'capacity')
('capacity', 'of')
('of', '800')
('800', 'people')
('people', 'no')
('no', 'smoking')
('smoking', 'allowed')
('allowed', 'no')
('no', 'children')
('children', 'above')
('above', '12')
('12', 'yr')
('yr', 'old')
('old', '...')

Hope this helps 希望这可以帮助

Edit: 编辑：

If you're then going to use TF-IDF may i recommend sklearn.feature_extraction.text.TfidfVectorizer which has ngram_range as a parameter ngram_range=(2, 2) would give you the pairs that you're after, meaning you don't need to use the above code before hand. 如果您接下来要使用TF-IDF，我可能建议sklearn.feature_extraction.text.TfidfVectorizer ，将ngram_range作为参数ngram_range=(2, 2)将为您提供您想要的对，这意味着您不会需要事先使用上面的代码。

具有两个单词标记的文本分类

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-03-11 21:05:55

具有两个单词标记的文本分类

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-03-11 21:05:55

解决方案1
0 已采纳 2019-03-11 21:05:55