简体繁体 English

特征提取NLP

[英]Feature extraction NLP

原文 2018-12-29 09:09:09 1 2 python/ machine-learning/ nlp/ doc2vec

I'm working on a reviews dataset. 我正在处理评论数据集。 The problem is to fetch the important(number of times the same feature reviewed) positive and negative features of that specific product from the reviews. 问题是要从评论中获取该特定产品的重要（正面评价次数）正面和负面特征。

Ex: some xyz car 例如： some xyz car

positive: Great mileage, good looking, spacious etc 正面：行驶里程长，外观漂亮，宽敞等

Negative: Poor power, bad performance, software problems etc 负面：功能不佳，性能不佳，软件问题等

Thing is to extract the best and worst things about the product! 事情是要提取关于产品的最佳和最糟糕的东西！

Until now I've used gensim's doc2vec to find the top positive and negative sentence. 到目前为止，我一直使用gensim的doc2vec查找最上面的肯定和否定句子。 The results are not so good and because it gets similar sentences with structure, not similar feathers it holds. 结果不是很好，因为它得到的句子结构相似，而羽毛却不相似。

2 个解决方案

Some write-ups of the "Word Mover's Distance" calculation, for identifying similar sentences/phrases, use reviews as their dataset and seem to extract common themes and representative phrases well. 一些“单词移动器的距离”计算的文章，用于识别相似的句子/短语，使用评论作为其数据集，并且似乎很好地提取了常见主题和代表性短语。

See for example: 参见例如：

"Navigating themes in restaurant reviews with Word Mover's Distance" http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/ “使用Word Mover的距离在餐厅评论中导航主题” http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

"Finding similar documents with Word2Vec and WMD" https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html “使用Word2Vec和WMD查找相似的文档” https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

It look like you want to extract features about a product, which is most commonly spoken in your reviews. 您似乎想要提取有关某产品的功能，这在您的评论中最常被提及。 This is typical topic clustering problem. 这是典型的主题聚类问题。 You could use Latent Dirichlet Allocation model to do topic clustering. 您可以使用潜在Dirichlet分配模型进行主题聚类。

This approach would give you the features, then you can run the sentiment analysis model to know the positive or negative sentiment towards that feature. 这种方法将为您提供功能，然后您可以运行情感分析模型以了解对该功能的正面或负面情绪。

By chance, if you know of the features already and you want to group into some clusters then look at this Q&A and the mentioned paper in the question. 碰巧的是，如果您已经了解这些功能，并且希望将其归为一组，那么请查看此问答和问题中提到的论文。