简体   繁体   English

替换 pandas 系列中同一列中的字符串

[英]Replace string in the same column in Series pandas

Things could be not easy to understand at first sight, but step by step...事情可能乍一看并不容易理解,但一步一步来......

This is beginning of my circa 50000 rows Dataframe df_answers_clean这是我大约 50000 行 Dataframe df_answers_clean的开始在此处输入图像描述

After using len(set(df_answers_clean['Race'])) I get 98 unique positions, which is too much for my future classifications.使用len(set(df_answers_clean['Race']))后,我得到了 98 个独特的位置,这对于我未来的分类来说太多了。 I made a example - list of the 25 positions of them are below:我举了一个例子——他们的 25 个职位列表如下:

['Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent', 'Hispanic or Latino/Latina; South Asian', 'East Asian; Hispanic or Latino/Latina',
 'East Asian', 'Black or of African descent; East Asian; South Asian; White or of European descent',
 'Black or of African descent; East Asian; Hispanic or Latino/Latina; Middle Eastern; Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent',
 'Hispanic or Latino/Latina; White or of European descent', 'White or of European descent; I prefer not to say', 'South Asian; White or of European descent', 'White or of European descent',
 'Hispanic or Latino/Latina', 'Black or of African descent; I don’t know; I prefer not to say',
 'Native American, Pacific Islander, or Indigenous Australian; White or of European descent; I don’t know', 'East Asian; White or of European descent; I don’t know', 'Native American, Pacific Islander, or Indigenous Australian', 'South Asian; White or of European descent; I don’t know',
 'Black or of African descent; Middle Eastern; White or of European descent; I don’t know',
 'Hispanic or Latino/Latina; Middle Eastern; White or of European descent',
 'Middle Eastern; White or of European descent',
 'Middle Eastern; South Asian']

I clean this mess with many lines of code:我用多行代码清理了这个烂摊子:

df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Black or of African descent[\s\S]*', 'Black or of African descent')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^East Asian[\s\S]*', 'East Asian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Hispanic or Latino/Latina[\s\S]*', 'Hispanic or Latino/Latina')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Middle Eastern[\s\S]*', 'Middle Eastern')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Native American, Pacific Islander, or Indigenous Australian[\s\S]*', 'Native American, Pacific Islander, or Indigenous Australian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^South Asian[\s\S]*', 'South Asian')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^White or of European descent[\s\S]*', 'White or of European descent')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I don’t know[\s\S]*', 'No data')    
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I prefer not to say[\s\S]*', 'No data')

The result is unique group which only now is useful for the later classification task:结果是唯一的组,它现在才对后面的分类任务有用:

{'Black or of African descent',
 'East Asian',
 'Hispanic or Latino/Latina',
 'Middle Eastern',
 'Native American, Pacific Islander, or Indigenous Australian',
 'No data',
 'South Asian',
 'White or of European descent'}

As I said - it works, but many lines of duplicated code is not functional/practical.正如我所说 - 它可以工作,但是许多重复的代码行不实用/不实用。

My another idea of doing this was make a list of my final result ( race_names_change ) and put everything by for-loop:我这样做的另一个想法是列出我的最终结果( race_names_change )并将所有内容都放在for循环中:

race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']

for i in race_names_change:
    replace_string = str('^'+ i +'[\s\S]*')
    df_answers_clean['Race'].str.replace('replace_string', i, regex=True)

But unfortunately it does not work - list is the same as at the beginning (98 positions).但不幸的是它不起作用 - 列表与开始时相同(98 个位置)。

Maybe is something wrong in the loop code or any other way of doing this (map, apply)?也许循环代码或任何其他方式(映射,应用)有问题?

Thanks for advice.谢谢你的建议。

If you can create a dictionary with the required regex patterns and corresponding outputs, then you can simply use pd.Series.replace()如果您可以使用所需的正则表达式模式和相应的输出创建字典,那么您可以简单地使用pd.Series.replace()

d = {
    'pattern1':'output1',
    'pattern2':'output2'
    }

df_answers_clean['Race'].replace(d, regex=True)

Note, pd.Series.str.replace() is different than pd.Series.replace()注意, pd.Series.str.replace()不同于pd.Series.replace()


Try this -尝试这个 -

race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']

d = {}
for i in race_names_change:
    replace_string = str('^'+ i +'[\s\S]*')
    replace_string
    d.update({replace_string:i})
    
df_answers_clean['Race'].replace(d, regex=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM