[英]How to handle numbers embedded in text during NLP pre-processing?
I am trying to run the LDA algorithm on a data set of news articles.我正在尝试在新闻文章数据集上运行 LDA 算法。 I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks.
我知道在预处理步骤中必须删除数字,并且我编写了一个简单的正则表达式代码来用空格替换数字。
df['number_removed'] = df['text'].str.replace('\d+', '',regex=True)
However, I would like to retain some numbers since removing them can potentially change the context/topic.但是,我想保留一些数字,因为删除它们可能会改变上下文/主题。 For example,
例如,
[Desired] 'The fourth industrial revolution also referred to as Industry 40 is starting to change the way goods are produced' [期望]“第四次工业革命也被称为工业 40 正在开始改变商品的生产方式”
[Wrong] 'The fourth industrial revolution also referred to as Industry is starting to change the way goods are produced' [错误]“第四次工业革命也被称为工业正在开始改变商品的生产方式”
Note: The punctuations have been removed in the example as part of pre-processing注意:作为预处理的一部分,示例中的标点符号已被删除
So, I was wondering:所以,我想知道:
What is sometimes done in similar situations is to replace numbers with a dummy token, such as <NUMBER>
, so that the fact that there was a number in the original text is preserved, but without disturbing the syntactic context.有时在类似情况下所做的是将数字替换为虚拟标记,例如
<NUMBER>
,以便保留原始文本中有数字的事实,但不会干扰句法上下文。 The actual value is usually not that important for generalisations.实际值通常对于概括而言并不那么重要。
If you want to retain concrete numbers (like "industry 40") then I guess you need to adjust your regex to keep those patterns.如果您想保留具体数字(例如“行业 40”),那么我想您需要调整正则表达式以保持这些模式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.