NLP预处理过程中如何处理文本中嵌入的数字？

Question

I am trying to run the LDA algorithm on a data set of news articles.我正在尝试在新闻文章数据集上运行 LDA 算法。 I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks.我知道在预处理步骤中必须删除数字，并且我编写了一个简单的正则表达式代码来用空格替换数字。

df['number_removed'] = df['text'].str.replace('\d+', '',regex=True)

However, I would like to retain some numbers since removing them can potentially change the context/topic.但是，我想保留一些数字，因为删除它们可能会改变上下文/主题。 For example,例如，

[Desired] 'The fourth industrial revolution also referred to as Industry 40 is starting to change the way goods are produced' [期望]“第四次工业革命也被称为工业 40 正在开始改变商品的生产方式”

[Wrong] 'The fourth industrial revolution also referred to as Industry is starting to change the way goods are produced' [错误]“第四次工业革命也被称为工业正在开始改变商品的生产方式”

Note: The punctuations have been removed in the example as part of pre-processing注意：作为预处理的一部分，示例中的标点符号已被删除

So, I was wondering:所以，我想知道：

Can essential numbers be retained before running LDA?在运行 LDA 之前可以保留基本数字吗？
How to selectively remove numbers or handle the above situation?如何选择性地删除数字或处理上述情况？

Answer 1

What is sometimes done in similar situations is to replace numbers with a dummy token, such as <NUMBER> , so that the fact that there was a number in the original text is preserved, but without disturbing the syntactic context.有时在类似情况下所做的是将数字替换为虚拟标记，例如<NUMBER> ，以便保留原始文本中有数字的事实，但不会干扰句法上下文。 The actual value is usually not that important for generalisations.实际值通常对于概括而言并不那么重要。

If you want to retain concrete numbers (like "industry 40") then I guess you need to adjust your regex to keep those patterns.如果您想保留具体数字（例如“行业 40”），那么我想您需要调整正则表达式以保持这些模式。

NLP预处理过程中如何处理文本中嵌入的数字？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-07-14 15:28:25

NLP预处理过程中如何处理文本中嵌入的数字？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-07-14 15:28:25

解决方案1
0 已采纳 2022-07-14 15:28:25