简体   繁体   English

为线性回归预处理文本是否有正确的步骤?

[英]Is there correct steps in preprocessing text for linear regression?

i've combined two different datasets so that one column has text and another column has the sentiment score (binary 0, 1)我组合了两个不同的数据集,以便一列包含文本,另一列包含情感分数(二进制 0、1)

I'm trying to make a linear regression model that predicts sentiment based on words used in the text, so far to preprocess the text, i changed the text to lowercase for all texts.我正在尝试进行线性回归 model 根据文本中使用的单词预测情绪,到目前为止,为了预处理文本,我将所有文本的文本更改为小写。

i'm wondering what the next step is after this?我想知道这之后的下一步是什么? i've read up a bit but i'm thinking i may not have the steps in the correct order.我读了一点,但我想我可能没有正确顺序的步骤。

1. lowercase                         1. lowercase
2. remove punctuation               2. tokenize
3.tokenize                          3. remove punctuation

which way is more correct, if i remove the punctuation first i might lose details such as don't and can't.哪种方式更正确,如果我先删除标点符号,我可能会丢失诸如不要和不能等细节。

Preprocessing text for machine learning is usually involved two steps: (I) clearning text, and (II) transforming text to number (aka. embedding).机器学习的文本预处理通常涉及两个步骤:(I)清除文本,以及(II)将文本转换为数字(也称为嵌入)。 Selecting techniques for these two steps is quite depending on the task and they are related to eachother.为这两个步骤选择技术完全取决于任务,并且它们彼此相关。

(I) Clearning text : usually involves to (i) handling the case of text, (ii) handling punctuations, (iii) handling stopwords (I) 清除文本:通常涉及到 (i) 处理文本的大小写,(ii) 处理标点符号,(iii) 处理停用词

(i) Handling the case of text : if your text is an English corpus and the selected embedding technique is for similarity measure related task, then it's better to convert all the text/corpus to lowercase. (i) 处理文本的大小写:如果您的文本是英文语料库,并且选择的嵌入技术是用于相似性度量相关的任务,那么最好将所有文本/语料库转换为小写。 However, if your tasks (eg., tagging, machine translation, etc.) using word embeddings as input representation of words in a sequence model then text casing might matter.但是,如果您的任务(例如,标记、机器翻译等)使用词嵌入作为序列 model 中词的输入表示,那么文本大小写可能很重要。 It's better to convert the text to lower case before embedding for your regression task.在嵌入回归任务之前,最好将文本转换为小写。

(ii) Handling punctuations ,"#$%&\'()*+.-:/;?<=>?@[\\]^_ {|}~`: if you use word embedding techniques for similarity related tasks, then you can clean/eliminate punctuations with substituations (eg, replace with ' ') from your text corpus. The word embedding for those tasks can be Bag of Words (BoW), Word2Vect, etc. For your specific task here (regression), then it's good to clean punctuations with substitution of ' ' . For some applications (eg, multilangual machine translation), punctuation might be important. (ii) 处理标点符号,"#$%&\'()*+.-:/;?<=>?@[\\]^_ {|}~`:如果你使用词嵌入技术进行相似性相关任务, 然后你可以从你的文本语料库中清理/消除带有替换的标点符号(例如,用''替换)。这些任务的词嵌入可以是 Bag of Words (BoW)、Word2Vect 等。对于您的特定任务,请点击此处(回归) ,那么最好用 ' ' 替换来清除标点符号。对于某些应用程序(例如,多语言机器翻译),标点符号可能很重要。

(iii) handling stopwords : Stop word (eg, the, i, he, …) is a word appeared with a very high frequency on a corpus. (iii) 处理停用词:停用词(例如,the, i, he, ...)是在语料库中出现频率非常高的词。 Stopwords usually don't provide useful information for the context or the true meaning of a sentence.停用词通常不会为上下文或句子的真正含义提供有用的信息。 Common NLP library such as NTK, gensim, spaCy, sklearn provided list of stopwords for some languages.常见的 NLP 库如 NTK、gensim、spaCy、sklearn 提供了某些语言的停用词列表。 For the similarity related tasks, it's better to remove stopwords before doing embeddings.对于相似性相关的任务,最好在进行嵌入之前删除停用词。 Removing stopwords is applied for your task (regression).删除停用词适用于您的任务(回归)。 Stopwords can be useful and should not be removed before learning embedding in some other tasks (eg, machine translations).停用词可能很有用,在学习嵌入到其他一些任务(例如机器翻译)之前不应将其删除。 It's better to remove stopwords for your regression task最好删除回归任务的停用词

(II) Transforming text to number (embedding): To be able to fit text data to a machine learning model (eg, your regression model), you need to transform the text data to vectors of number. (II) 将文本转换为数字(嵌入):为了能够将文本数据适合机器学习 model(例如,您的回归模型),您需要将文本数据转换为数字向量。 Tokenization is required before this transformation process.在此转换过程之前需要进行Tokenization In NLP/ML, this transformation process is called embedding .在 NLP/ML 中,这种转换过程称为embedding There are many different approaches to do Word Embeddings in NLP (eg, frequency term (BoW), co-occorrence statistics (GloVe), probabilistic model (LDA2Vec), neural networks (Word2Vec, FastText, BIRD, ...) based approaches).在 NLP 中进行词嵌入有许多不同的方法(例如,频率项 (BoW)、共现统计 (GloVe)、概率 model (LDA2Vec)、神经网络(Word2Vec、FastText、BIRD、...)基于方法) . Each technique has its pros and cons.每种技术都有其优点和缺点。 Selecting a word embedding technique is very depending on your application/task.选择词嵌入技术很大程度上取决于您的应用程序/任务。 There will be not enough space to write about each word embedding approach/technique here.这里没有足够的空间来写每个词嵌入方法/技术。

The followings are some online tutorials for working with text that can help you quick go through and apply to your problem:以下是一些处理文本的在线教程,可以帮助您快速解决 go 并适用于您的问题:

1) Sklearn - working with text data ( https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html ) 1) Sklearn - 使用文本数据 ( https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html )

2) NLTK tutorial ( https://www.nltk.org/book/ch01.html ) 2)NLTK教程( https://www.nltk.org/book/ch01.html

3) Spacy - Language processing pipelines tutorial ( https://spacy.io/usage/processing-pipelines ) 3) Spacy - 语言处理管道教程( https://spacy.io/usage/processing-pipelines

4) How to Clean Text for Machine Learning with Python ( https://machinelearningmastery.com/clean-text-machine-learning-python/ ) 4) 如何使用 Python 为机器学习清理文本( https://machinelearningmastery.com/clean-text-machine-learning-python/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM