[英]Replace string in the same column in Series pandas
事情可能乍一看並不容易理解,但一步一步來......
這是我大約 50000 行 Dataframe df_answers_clean
的開始
使用len(set(df_answers_clean['Race']))
后,我得到了 98 個獨特的位置,這對於我未來的分類來說太多了。 我舉了一個例子——他們的 25 個職位列表如下:
['Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent', 'Hispanic or Latino/Latina; South Asian', 'East Asian; Hispanic or Latino/Latina',
'East Asian', 'Black or of African descent; East Asian; South Asian; White or of European descent',
'Black or of African descent; East Asian; Hispanic or Latino/Latina; Middle Eastern; Native American, Pacific Islander, or Indigenous Australian; South Asian; White or of European descent',
'Hispanic or Latino/Latina; White or of European descent', 'White or of European descent; I prefer not to say', 'South Asian; White or of European descent', 'White or of European descent',
'Hispanic or Latino/Latina', 'Black or of African descent; I don’t know; I prefer not to say',
'Native American, Pacific Islander, or Indigenous Australian; White or of European descent; I don’t know', 'East Asian; White or of European descent; I don’t know', 'Native American, Pacific Islander, or Indigenous Australian', 'South Asian; White or of European descent; I don’t know',
'Black or of African descent; Middle Eastern; White or of European descent; I don’t know',
'Hispanic or Latino/Latina; Middle Eastern; White or of European descent',
'Middle Eastern; White or of European descent',
'Middle Eastern; South Asian']
我用多行代碼清理了這個爛攤子:
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Black or of African descent[\s\S]*', 'Black or of African descent')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^East Asian[\s\S]*', 'East Asian')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Hispanic or Latino/Latina[\s\S]*', 'Hispanic or Latino/Latina')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Middle Eastern[\s\S]*', 'Middle Eastern')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^Native American, Pacific Islander, or Indigenous Australian[\s\S]*', 'Native American, Pacific Islander, or Indigenous Australian')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^South Asian[\s\S]*', 'South Asian')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^White or of European descent[\s\S]*', 'White or of European descent')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I don’t know[\s\S]*', 'No data')
df_answers_clean['Race'] = df_answers_clean['Race'].str.replace('^I prefer not to say[\s\S]*', 'No data')
結果是唯一的組,它現在才對后面的分類任務有用:
{'Black or of African descent',
'East Asian',
'Hispanic or Latino/Latina',
'Middle Eastern',
'Native American, Pacific Islander, or Indigenous Australian',
'No data',
'South Asian',
'White or of European descent'}
正如我所說 - 它可以工作,但是許多重復的代碼行不實用/不實用。
我這樣做的另一個想法是列出我的最終結果( race_names_change
)並將所有內容都放在for循環中:
race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']
for i in race_names_change:
replace_string = str('^'+ i +'[\s\S]*')
df_answers_clean['Race'].str.replace('replace_string', i, regex=True)
但不幸的是它不起作用 - 列表與開始時相同(98 個位置)。
也許循環代碼或任何其他方式(映射,應用)有問題?
謝謝你的建議。
如果您可以使用所需的正則表達式模式和相應的輸出創建字典,那么您可以簡單地使用pd.Series.replace()
d = {
'pattern1':'output1',
'pattern2':'output2'
}
df_answers_clean['Race'].replace(d, regex=True)
注意, pd.Series.str.replace()
不同於pd.Series.replace()
嘗試這個 -
race_names_change = ['Black or of African descent', 'East Asian', 'Hispanic or Latino/Latina', 'Middle Eastern', 'South Asian', 'Native American, Pacific Islander, or Indigenous Australian', 'White or of European descent']
d = {}
for i in race_names_change:
replace_string = str('^'+ i +'[\s\S]*')
replace_string
d.update({replace_string:i})
df_answers_clean['Race'].replace(d, regex=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.