简体   繁体   English

NLP预处理过程中如何处理文本中嵌入的数字?

[英]How to handle numbers embedded in text during NLP pre-processing?

I am trying to run the LDA algorithm on a data set of news articles.我正在尝试在新闻文章数据集上运行 LDA 算法。 I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks.我知道在预处理步骤中必须删除数字,并且我编写了一个简单的正则表达式代码来用空格替换数字。

df['number_removed'] = df['text'].str.replace('\d+', '',regex=True)

However, I would like to retain some numbers since removing them can potentially change the context/topic.但是,我想保留一些数字,因为删除它们可能会改变上下文/主题。 For example,例如,

[Desired] 'The fourth industrial revolution also referred to as Industry 40 is starting to change the way goods are produced' [期望]“第四次工业革命也被称为工业 40 正在开始改变商品的生产方式”

[Wrong] 'The fourth industrial revolution also referred to as Industry is starting to change the way goods are produced' [错误]“第四次工业革命也被称为工业正在开始改变商品的生产方式”

Note: The punctuations have been removed in the example as part of pre-processing注意:作为预处理的一部分,示例中的标点符号已被删除

So, I was wondering:所以,我想知道:

  1. Can essential numbers be retained before running LDA?在运行 LDA 之前可以保留基本数字吗?
  2. How to selectively remove numbers or handle the above situation?如何选择性地删除数字或处理上述情况?

What is sometimes done in similar situations is to replace numbers with a dummy token, such as <NUMBER> , so that the fact that there was a number in the original text is preserved, but without disturbing the syntactic context.有时在类似情况下所做的是将数字替换为虚拟标记,例如<NUMBER> ,以便保留原始文本中有数字的事实,但不会干扰句法上下文。 The actual value is usually not that important for generalisations.实际值通常对于概括而言并不那么重要。

If you want to retain concrete numbers (like "industry 40") then I guess you need to adjust your regex to keep those patterns.如果您想保留具体数字(例如“行业 40”),那么我想您需要调整正则表达式以保持这些模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM