简体繁体 English

如何在nlp中进行特征提取

[英]How to proceed with feature extraction in nlp

原文 2018-02-08 07:19:41 5 3 python/ machine-learning/ nlp/ nltk/ naivebayes

I am building a multi class text classifier , that has data set of a job portal. 我正在构建一个具有工作门户网站数据集的多类文本分类器。 The data set consists of names of organisations mapped to actual name (see below). 数据集由映射到实际名称的组织名称组成（请参见下文）。 I want to make a ml model which can predict actual organisation name. 我想制作一个可以预测实际组织名称的ml模型。
My data set looks like this: 我的数据集如下所示：

Flipkart.com flipkart Flipkart.com flipkart

FlipKart pvt ltd flipkart FlipKart私人有限公司

flipkart.com flipkart flipkart.com flipkart

My question is this: 我的问题是这样的：

A.) What kind of features can I extract? 答：）我可以提取哪些功能？
B.) Should my feature extractor use the labels of training set too? B.）我的特征提取器也应该使用训练集的标签吗？
C.) What should my features look like, since they are supposed to be dict for nbclassifier. C.）我的功能应该是什么样的，因为它们应该是nbclassifier的决定。 What key to what value? 什么钥匙具有什么价值？

I'm new to NLP, any help would be appreciated. 我是NLP的新手，我们将不胜感激。 Source code on github github上的源代码

3 个解决方案

I would leave machine learning out of the equation. 我将机器学习排除在外。 What you're trying to do is fuzzymatching, with potentially some synonym deprecation. 您正在尝试做的是模糊匹配，可能会淘汰某些同义词。

An expensive technique is the levenshtein distance formula, a cheaper, although just as effective in some cases, technique is token/ngram chunking and indexing. levenshtein距离公式是一种昂贵的技术，价格便宜，尽管在某些情况下同样有效，但技术是令牌/ ngram分块和索引。

Make a dictionary of n-grams where n is the length of the gram. 制作一个n-gram的字典，其中n是克的长度。 n = 3, then grams for Flipkart.com are 'Fli', 'lip', 'ipk', 'pka', etc..., with the key being the ngram and the value being a list of matches that contain that ngram. n = 3，则Flipkart.com的克数为'Fli'，'lip'，'ipk'，'pka'等...，键为ngram，值为包含该ngram的匹配项列表。 For each of your n-grams in the input string, look it up in the dict (achieved at O(log(n)m) where n is number of total indexed n-grams and m is number of n-grams in input string), and tally the results until you have a 'score' for each match according to how many n-grams it shares with the input string. 对于输入字符串中的每个n-gram，在dict中查找（在O（log（n）m）处获得，其中n是总索引n-gram的数量，m是输入字符串中的n-gram的数量）），然后根据与输入字符串共享的n-gram数目，对结果进行计数直到每个匹配项都有一个“分数”。

The 'chunking' I mentioned is indexing 'chunks' or sets of n-grams and performing the same task. 我提到的“块”是对“块”或n-gram组进行索引，并执行相同的任务。 Aka ['Fli', 'lip', 'ipk'] would be what is indexed and used to tally results. Aka ['Fli'，'lip'，'ipk']将被索引并用于计算结果。

These techniques can be peformed using 'tokens' as well, rather than or in addition to n-grams, to capture entire words that match. 这些技术也可以使用“令牌”而不是n-gram或除n-gram之外使用，以捕获匹配的整个单词。

None of this requires statistics and, instead, leverages an understanding of language. 这些都不要求统计，而是利用对语言的理解。

Or, you can try to derive a meaningful set of features from a list of short strings to map to an extremely large set of classes. 或者，您可以尝试从短字符串列表中派生出有意义的功能集，以映射到非常大的类集。 This will be an extremely difficult task, and so I recommend the fuzzymatching approach. 这将是一项极其困难的任务，因此我建议使用模糊匹配方法。

First of all you have to convert all the text data into machine readable form as machine learning algorithms only understand vectors. 首先，您必须将所有文本数据转换为机器可读形式，因为机器学习算法只能理解向量。

1) Find the vocabulary of the dataset 1）查找数据集的词汇

2) Use CountVectorizer() or tfidfVectorizer() to convert the text into vectors 2）使用CountVectorizer（）或tfidfVectorizer（）将文本转换为向量

3) Now train a naive bayes classifier on the pre-processed dataset 3）现在在预处理数据集上训练朴素贝叶斯分类器

for more detail check this out https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/ 有关更多详细信息，请查看https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

Since you are trying to work with the text I think you should train your model using the GloVe model. 由于您尝试使用文本，因此我认为您应该使用GloVe模型训练模型。 This model is a word to vector model Which has a Large Dataset with the vectors for all the words in the dataset. 此模型是向量模型的一个词，该模型具有一个大数据集，其中包含数据集中所有词的向量。 GloVe Model: https://nlp.stanford.edu/projects/glove/ GloVe模型： https ：//nlp.stanford.edu/projects/glove/

The advance version is the Sentence Encoder https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf 高级版本是Sentence编码器https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

I believe you must study the word embeddings to get a brief idea to proceed with text processing. 我相信您必须研究单词嵌入这一概念，才能获得一个简短的思路来进行文本处理。 You can see the details to proceed with Text Processing here: https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/ 您可以在此处查看继续进行文本处理的详细信息： https : //www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

I hope this helps. 我希望这有帮助。 All the best. 祝一切顺利。