替换 pandas 系列中同一列中的字符串

Question

Things could be not easy to understand at first sight, but step by step...事情可能乍一看并不容易理解，但一步一步来......

This is beginning of my circa 50000 rows Dataframe df_answers_clean这是我大约 50000 行 Dataframe df_answers_clean的开始

After using len(set(df_answers_clean['Race'])) I get 98 unique positions, which is too much for my future classifications.使用len(set(df_answers_clean['Race']))后，我得到了 98 个独特的位置，这对于我未来的分类来说太多了。 I made a example - list of the 25 positions of them are below:我举了一个例子——他们的 25 个职位列表如下：

['Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent', 'Hispanic or Latino/Latina; South Asian', 'East Asian; Hispanic or Latino/Latina',
 'East Asian', 'Black or of African descent; East Asian; South Asian; White or of European descent',
 'Black or of African descent; East Asian; Hispanic or Latino/Latina; Middle Eastern; Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent',
 'Hispanic or Latino/Latina; White or of European descent', 'White or of European descent; I prefer not to say', 'South Asian; White or of European descent', 'White or of European descent',
 'Hispanic or Latino/Latina', 'Black or of African descent; I don’t know; I prefer not to say',
 'Native American, Pacific Islander, or Indigenous Australian; White or of European descent; I don’t know', 'East Asian; White or of European descent; I don’t know', 'Native American, Pacific Islander, or Indigenous Australian', 'South Asian; White or of European descent; I don’t know',
 'Black or of African descent; Middle Eastern; White or of European descent; I don’t know',
 'Hispanic or Latino/Latina; Middle Eastern; White or of European descent',
 'Middle Eastern; White or of European descent',
 'Middle Eastern; South Asian']

I clean this mess with many lines of code:我用多行代码清理了这个烂摊子：

df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Black or of African descent[\s\S]*', 'Black or of African descent')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^East Asian[\s\S]*', 'East Asian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Hispanic or Latino/Latina[\s\S]*', 'Hispanic or Latino/Latina')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Middle Eastern[\s\S]*', 'Middle Eastern')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Native American, Pacific Islander, or Indigenous Australian[\s\S]*', 'Native American, Pacific Islander, or Indigenous Australian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^South Asian[\s\S]*', 'South Asian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^White or of European descent[\s\S]*', 'White or of European descent')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I don’t know[\s\S]*', 'No data')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I prefer not to say[\s\S]*', 'No data')

The result is unique group which only now is useful for the later classification task:结果是唯一的组，它现在才对后面的分类任务有用：

{'Black or of African descent',
 'East Asian',
 'Hispanic or Latino/Latina',
 'Middle Eastern',
 'Native American, Pacific Islander, or Indigenous Australian',
 'No data',
 'South Asian',
 'White or of European descent'}

As I said - it works, but many lines of duplicated code is not functional/practical.正如我所说 - 它可以工作，但是许多重复的代码行不实用/不实用。

My another idea of doing this was make a list of my final result ( race_names_change ) and put everything by for-loop:我这样做的另一个想法是列出我的最终结果（ race_names_change ）并将所有内容都放在for循环中：

race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']

for i in race_names_change:
    replace_string = str('^'+ i +'[\s\S]*')
    df_answers_clean['Race'].str.replace('replace_string', i, regex=True)

But unfortunately it does not work - list is the same as at the beginning (98 positions).但不幸的是它不起作用 - 列表与开始时相同（98 个位置）。

Maybe is something wrong in the loop code or any other way of doing this (map, apply)?也许循环代码或任何其他方式（映射，应用）有问题？

Thanks for advice.谢谢你的建议。

Answer 1

If you can create a dictionary with the required regex patterns and corresponding outputs, then you can simply use pd.Series.replace()如果您可以使用所需的正则表达式模式和相应的输出创建字典，那么您可以简单地使用pd.Series.replace()

d = {
    'pattern1':'output1',
    'pattern2':'output2'
    }

df_answers_clean['Race'].replace(d, regex=True)

Note, pd.Series.str.replace() is different than pd.Series.replace()注意， pd.Series.str.replace()不同于pd.Series.replace()

Try this -尝试这个 -

race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']

d = {}
for i in race_names_change:
    replace_string = str('^'+ i +'[\s\S]*')
    replace_string
    d.update({replace_string:i})
    
df_answers_clean['Race'].replace(d, regex=True)

替换 pandas 系列中同一列中的字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-11 22:43:50

替换 pandas 系列中同一列中的字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-11 22:43:50

解决方案1
1 已采纳 2021-02-11 22:43:50