简体   繁体   English

在 Python 中预处理具有不同符号的文本数据

[英]Preprocessing text data with different notation in Python

Using Python 3, I work with a data frame which requires text preprocessing.使用 Python 3,我使用需要文本预处理的数据框。

The data frame consists of historical sales for many different medical products with many different strengths.该数据框包括具有许多不同优势的许多不同医疗产品的历史销售额。 For simplification, the code below only shows a part of the strength column.为简单起见,下面的代码仅显示了强度列的一部分。

df = pd.DataFrame({'Strength': ['20 mg / 120 mg', ' 40/320 mg', '20mg/120mg', '150+750mg', '20/120MG', '62.5mg/375mg', '100 mg', 'Product1 20 mg, Product2 120 mg', '40mg/320mg', 'Product 20mg/120mg', 'Product1 20mg Product2 120mg', '100mg/1ml', '20 mg./ 120 mg..', '62.5 mg / 375 mg', '40/320mg 9s', '40/320', '50/125', '100mg..' '20/120']})
                                 Strength
0                          20 mg / 120 mg
1                               40/320 mg
2                              20mg/120mg
3                               150+750mg
4                                20/120MG
5                            62.5mg/375mg
6                                  100 mg
7         Product1 20 mg, Product2 120 mg
8                              40mg/320mg
9                      Product 20mg/120mg
10           Product1 20mg Product2 120mg
11                              100mg/1ml
12                       20 mg./ 120 mg..
13                       62.5 mg / 375 mg
14                            40/320mg 9s
15                                 40/320
16                                 50/125
17                          100mg..20/120

As you can see, there are different spellings for products which actually belong to the same Strength.如您所见,实际上属于同一强度的产品有不同的拼写。 For example, '20 mg / 120 mg' and 'Artemether 20 mg, Lumefantrine 120 mg' actually have the same strength.例如,“20 mg / 120 mg”和“Artemether 20 mg, Lumefantrine 120 mg”实际上具有相同的强度。

Setting the text to lowercase, removing whitespaces and replacing + by / shown by the following code brings some standardization, but there are still lines with clearly the same strength.将文本设置为小写,删除空格并将 + 替换为以下代码所示的 / 带来了一些标准化,但仍然有明显相同强度的行。

df['Strength'] = df['Strength'].str.lower()
df['Strength'] = df['Strength'].str.replace(' ', '')
df['Strength'] = df['Strength'].str.replace('+', '/')

Adding commands like the following allows to further reduce the number of different notations, but this is way too manual.添加如下命令可以进一步减少不同符号的数量,但这太手动了。

df['Strength'].loc[df['Strength'].str.contains('Product1', case=False)
                   & df['Strength'].str.contains('Product2', case=False)] = '20mg/120mg'

Do you have any approaches for removing the number of unique notations in an efficient way?您是否有任何方法可以有效地删除唯一符号的数量?

Add a new column with fixed labels for each strength and train it based on a suitable ml classifier and predict the appropriate strength for the new item.为每个强度添加一个带有固定标签的新列,并基于合适的 ml 分类器对其进行训练,并预测新项目的适当强度。

For each new notation, manually assign a new label and retrain again...对于每个新符号,手动分配一个新标签并再次重新训练......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM